Changyu Lee

A Report of Hands-on Project of Text Representations and the Use of Text Classification for Sentiment Analysis

Published at
2026/02/01
Last edited time
2026/02/01 20:02
Created
2026/01/30 22:43
Section
NLP & Prompt Enginnering
Status
Done
Series
From the Bottom
Tags
Project
Lecture Summary
AI summary
Keywords
SVM
LogisticRegression
Naive Bayes
Bigram
Language
ENG
Week
26-4
1 more property

Project Overview

This project implements a sentiment classification pipeline using Amazon reviews. The task is binary classification (positive vs. negative) based on star ratings. I built a full pipeline from dataset preparation and text cleaning to feature extraction (bigram count vectors) and model training/evaluation using four classical classifiers: Perceptron, SVM, Logistic Regression, and Multinomial Naive Bayes.
Key design choices:
Labels: Positive = 4–5 stars (y = 1), Negative = 1–2 stars (y = 0), Neural (3 stars) ignored.
Sampling: 100,000 positive + 100,000 negative reviews, shuffled (total 200,000).
Text processing: lowercasing, URL/HTML removal, non-alphabetic removal, contraction expansion, stopword removal, lemmatization.
Feature representation: bigram frequency vectors stored as a sparse matrix (CSR) to handle high dimensionality.
Evaluation: Accuracy, Precision, Recall, F1 for both training and testing splits (80/20).

Results

3 Analysis of Dataset

# 1. how many reviews received 1 ratings num_1_star_reviews = df[df['star_rating'] == 1].shape[0] print(f"Number of reviews with 1 star rating: {num_1_star_reviews}") # 2. how many products got average over 4 stars average_ratings = df.groupby('product_id')['star_rating'].mean() num_products_over_4_stars = (average_ratings > 4).sum() print(f"Number of products with average rating over 4 stars: {num_products_over_4_stars}") # 3. how many reviews got more than 5 helpful votes num_reviews_over_5_helpful_votes = df[df['helpful_votes'] > 5].shape[0] print(f"Number of reviews with more than 5 helpful votes: {num_reviews_over_5_helpful_votes}")
Python
복사
Number of reviews with 1 star rating: 306576 Number of products with average rating over 4 stars: 182508 Number of reviews with more than 5 helpful votes: 154764
Plain Text
복사

Relabeling and Sampling

Totally, from the dataset,
Positive reviews: 1998785 Neural reviews: 193441 Negative reviews: 444776
Python
복사
Average length before cleaning: 317.2775
Average length after cleaning: 301.1736

Pre-processing

3 examples before/after removing the stop words
[ 'i returned this case because i felt like it was going to break my phone that is as it sat on my belt my arms or anything that brushed against the phone would cause the slider to move id be walking around with it open without noticing i was afraid that it was going to snap off if i wasnt careful also the slider cover didnt really stay on very well it was nice to just be able to click the phone off of my belt and use it immediately but ive resorted to a belt clip that i have to open to take out the phone at least i feel that phone is safe', 'this is exactly what we needed and would order this item again it is just like the original no problems', 'cartridge empty upon arrival no ink disappointed' ]
Plain Text
복사
[ 'returned case felt like going break phone sat belt arms anything brushed phone would cause slider move id walking around open without noticing afraid going snap wasnt careful also slider cover didnt really stay well nice able click phone belt use immediately ive resorted belt clip open take phone least feel phone safe', 'exactly needed would order item like original problems', 'cartridge empty upon arrival ink disappointed' ]
Plain Text
복사
3 examples of before/after performing lemmatiazation
[ 'returned case felt like going break phone sat belt arms anything brushed phone would cause slider move id walking around open without noticing afraid going snap wasnt careful also slider cover didnt really stay well nice able click phone belt use immediately ive resorted belt clip open take phone least feel phone safe', 'exactly needed would order item like original problems', 'cartridge empty upon arrival ink disappointed' ]
Plain Text
복사
[ 'return case felt like go break phone sit belt arm anything brush phone would cause slider move id walk around open without notice afraid go snap wasnt careful also slider cover didnt really stay well nice able click phone belt use immediately ive resort belt clip open take phone least feel phone safe', 'exactly need would order item like original problem', 'cartridge empty upon arrival ink disappoint' ]
Plain Text
복사
Average length before/after preprocessing Average length before preprocessing: 301.1736
Average length after preprocessing: 185.1686

Perceptron

Perceptron Training Accuracy: 0.9922 Perceptron Training Precision: 0.9851 Perceptron Training Recall: 0.9994 Perceptron Training F1 Score: 0.9922 Perceptron Testing Accuracy: 0.8560 Perceptron Testing Precision: 0.8446 Perceptron Testing Recall: 0.8743 Perceptron Testing F1 Score: 0.8592
Plain Text
복사
Hyperparameters
from sklearn.linear_model import Perceptron model = Perceptron( max_iter=50, tol=1e-4, random_state=42 )
Python
복사

SVM

SVM Training Accuracy: 0.9920 SVM Training Precision: 0.9847 SVM Training Recall: 0.9995 SVM Training F1 Score: 0.9920 SVM Testing Accuracy: 0.8447 SVM Testing Precision: 0.8161 SVM Testing Recall: 0.8919 SVM Testing F1 Score: 0.8523
Plain Text
복사
Hyperparameters
from sklearn import svm model_svm = svm.LinearSVC(random_state=42)
Python
복사

Logistic Regression

Logistic Regression Training Accuracy: 0.9847 Logistic Regression Training Precision: 0.9727 Logistic Regression Training Recall: 0.9973 Logistic Regression Training F1 Score: 0.9849 Logistic Regression Testing Accuracy: 0.8663 Logistic Regression Testing Precision: 0.8386 Logistic Regression Testing Recall: 0.9088 Logistic Regression Testing F1 Score: 0.8723
Plain Text
복사
Hyperparameters
from sklearn.linear_model import LogisticRegression model_lr = LogisticRegression( max_iter=100, random_state=42 )
Python
복사

Naive Bayes

Naive Bayes Training Accuracy: 0.9423 Naive Bayes Training Precision: 0.9724 Naive Bayes Training Recall: 0.9102 Naive Bayes Training F1 Score: 0.9403 Naive Bayes Testing Accuracy: 0.8409 Naive Bayes Testing Precision: 0.8448 Naive Bayes Testing Recall: 0.8371 Naive Bayes Testing F1 Score: 0.8409
Plain Text
복사
Code
from sklearn.naive_bayes import MultinomialNB model_nb = MultinomialNB()
Python
복사

Experiment

Issue Handling

Reading Data Error

How I soved?
using on_bad_lines = “skip”

Bigram Vectorization on Massive & Sparse Dataset

Memory Allocation Error Occured
How I solved?
Using bigram features greatly increases the dimensionality of the feature space, making dense vector representations impractical due to memory constraints. To address this issue, each review was represented as a sparse dictionary that stores only the bigrams appearing in that document along with their frequencies. These sparse representations were then converted into a SciPy sparse matrix (COO format and later CSR), which significantly reduced memory usage while remaining compatible with linear classifiers. This approach enabled efficient training on a large-scale dataset without memory overflow.
# Dictionary mapping matrix n = len(df['review_body']) d = len(bigram_vocabs) X : list[dict] = [{} for _ in range(n)] for i, text in enumerate(df['review_body']): new_row_vec = {} bigrams = list(nltk.bigrams(text.split())) for bigram in bigrams: if bigram in bigram_vocabs: j = bigram_vocabs[bigram] # get index of bigram- new_row_vec[j] = new_row_vec.get(j, 0) + 1 X[i] = new_row_vec # add bigram features to data_ data_[f'x'] = X dataset_all = data_[['y','x']].to_dict(orient='records') rows = [] cols = [] data = [] y = np.empty(n, dtype=np.int32) for i, ex in enumerate(dataset_all): y[i] = int(ex["y"]) for j, v in ex["x"].items(): rows.append(i) cols.append(int(j)) data.append(float(v)) X = coo_matrix( (np.array(data, dtype=np.float32), (np.array(rows, dtype=np.int32), np.array(cols, dtype=np.int32))), shape=(n, d) ).tocsr()
Python
복사

What I’ve learned

Classical linear models are strong baselines for sentiment classification when paired with sparse n-gram features.
Text feature extraction often dominates the engineering complexity: the biggest challenge is not the classifier, but building an efficient representation for very high-dimensional sparse data.
Using NLTK, I’ve learned how to do lemmatization.
Bigram features can capture phrase-level sentiment signals (e.g., negations), but they require sparse vectorization and careful handling of vocabulary growth.
Model performance should be reported with multiple metrics (Accuracy / Precision / Recall / F1) because class-wise errors matter; F1 is especially useful when evaluating the balance between precision and recall.
For large-scale text classification, implementation details (CSR matrices, training time, scalability of the solver) are as important as the model choice itself.