Project Overview
This project implements a sentiment classification pipeline using Amazon reviews. The task is binary classification (positive vs. negative) based on star ratings. I built a full pipeline from dataset preparation and text cleaning to feature extraction (bigram count vectors) and model training/evaluation using four classical classifiers: Perceptron, SVM, Logistic Regression, and Multinomial Naive Bayes.
Key design choices:
•
Labels: Positive = 4–5 stars (y = 1), Negative = 1–2 stars (y = 0), Neural (3 stars) ignored.
•
Sampling: 100,000 positive + 100,000 negative reviews, shuffled (total 200,000).
•
Text processing: lowercasing, URL/HTML removal, non-alphabetic removal, contraction expansion, stopword removal, lemmatization.
•
Feature representation: bigram frequency vectors stored as a sparse matrix (CSR) to handle high dimensionality.
•
Evaluation: Accuracy, Precision, Recall, F1 for both training and testing splits (80/20).
Results
3 Analysis of Dataset
# 1. how many reviews received 1 ratings
num_1_star_reviews = df[df['star_rating'] == 1].shape[0]
print(f"Number of reviews with 1 star rating: {num_1_star_reviews}")
# 2. how many products got average over 4 stars
average_ratings = df.groupby('product_id')['star_rating'].mean()
num_products_over_4_stars = (average_ratings > 4).sum()
print(f"Number of products with average rating over 4 stars: {num_products_over_4_stars}")
# 3. how many reviews got more than 5 helpful votes
num_reviews_over_5_helpful_votes = df[df['helpful_votes'] > 5].shape[0]
print(f"Number of reviews with more than 5 helpful votes: {num_reviews_over_5_helpful_votes}")
Python
복사
Number of reviews with 1 star rating: 306576
Number of products with average rating over 4 stars: 182508
Number of reviews with more than 5 helpful votes: 154764
Plain Text
복사
Relabeling and Sampling
•
Totally, from the dataset,
Positive reviews: 1998785
Neural reviews: 193441
Negative reviews: 444776
Python
복사
Average length before cleaning: 317.2775
Average length after cleaning: 301.1736
Pre-processing
•
3 examples before/after removing the stop words
[
'i returned this case because i felt like it was going to break my phone that is as it sat on my belt my arms or anything that brushed against the phone would cause the slider to move id be walking around with it open without noticing i was afraid that it was going to snap off if i wasnt careful also the slider cover didnt really stay on very well it was nice to just be able to click the phone off of my belt and use it immediately but ive resorted to a belt clip that i have to open to take out the phone at least i feel that phone is safe',
'this is exactly what we needed and would order this item again it is just like the original no problems',
'cartridge empty upon arrival no ink disappointed'
]
Plain Text
복사
[
'returned case felt like going break phone sat belt arms anything brushed phone would cause slider move id walking around open without noticing afraid going snap wasnt careful also slider cover didnt really stay well nice able click phone belt use immediately ive resorted belt clip open take phone least feel phone safe',
'exactly needed would order item like original problems',
'cartridge empty upon arrival ink disappointed'
]
Plain Text
복사
•
3 examples of before/after performing lemmatiazation
[
'returned case felt like going break phone sat belt arms anything brushed phone would cause slider move id walking around open without noticing afraid going snap wasnt careful also slider cover didnt really stay well nice able click phone belt use immediately ive resorted belt clip open take phone least feel phone safe',
'exactly needed would order item like original problems',
'cartridge empty upon arrival ink disappointed'
]
Plain Text
복사
[
'return case felt like go break phone sit belt arm anything brush phone would cause slider move id walk around open without notice afraid go snap wasnt careful also slider cover didnt really stay well nice able click phone belt use immediately ive resort belt clip open take phone least feel phone safe',
'exactly need would order item like original problem',
'cartridge empty upon arrival ink disappoint'
]
Plain Text
복사
•
Average length before/after preprocessing
Average length before preprocessing: 301.1736
Average length after preprocessing: 185.1686
Perceptron
Perceptron Training Accuracy: 0.9922
Perceptron Training Precision: 0.9851
Perceptron Training Recall: 0.9994
Perceptron Training F1 Score: 0.9922
Perceptron Testing Accuracy: 0.8560
Perceptron Testing Precision: 0.8446
Perceptron Testing Recall: 0.8743
Perceptron Testing F1 Score: 0.8592
Plain Text
복사
•
Hyperparameters
from sklearn.linear_model import Perceptron
model = Perceptron(
max_iter=50,
tol=1e-4,
random_state=42
)
Python
복사
SVM
SVM Training Accuracy: 0.9920
SVM Training Precision: 0.9847
SVM Training Recall: 0.9995
SVM Training F1 Score: 0.9920
SVM Testing Accuracy: 0.8447
SVM Testing Precision: 0.8161
SVM Testing Recall: 0.8919
SVM Testing F1 Score: 0.8523
Plain Text
복사
•
Hyperparameters
from sklearn import svm
model_svm = svm.LinearSVC(random_state=42)
Python
복사
Logistic Regression
Logistic Regression Training Accuracy: 0.9847
Logistic Regression Training Precision: 0.9727
Logistic Regression Training Recall: 0.9973
Logistic Regression Training F1 Score: 0.9849
Logistic Regression Testing Accuracy: 0.8663
Logistic Regression Testing Precision: 0.8386
Logistic Regression Testing Recall: 0.9088
Logistic Regression Testing F1 Score: 0.8723
Plain Text
복사
•
Hyperparameters
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression(
max_iter=100,
random_state=42
)
Python
복사
Naive Bayes
Naive Bayes Training Accuracy: 0.9423
Naive Bayes Training Precision: 0.9724
Naive Bayes Training Recall: 0.9102
Naive Bayes Training F1 Score: 0.9403
Naive Bayes Testing Accuracy: 0.8409
Naive Bayes Testing Precision: 0.8448
Naive Bayes Testing Recall: 0.8371
Naive Bayes Testing F1 Score: 0.8409
Plain Text
복사
•
Code
from sklearn.naive_bayes import MultinomialNB
model_nb = MultinomialNB()
Python
복사
Experiment
Issue Handling
Reading Data Error
•
How I soved?
◦
using on_bad_lines = “skip”
Bigram Vectorization on Massive & Sparse Dataset
•
Memory Allocation Error Occured
•
How I solved?
◦
Using bigram features greatly increases the dimensionality of the feature space, making dense vector representations impractical due to memory constraints. To address this issue, each review was represented as a sparse dictionary that stores only the bigrams appearing in that document along with their frequencies. These sparse representations were then converted into a SciPy sparse matrix (COO format and later CSR), which significantly reduced memory usage while remaining compatible with linear classifiers. This approach enabled efficient training on a large-scale dataset without memory overflow.
# Dictionary mapping matrix
n = len(df['review_body'])
d = len(bigram_vocabs)
X : list[dict] = [{} for _ in range(n)]
for i, text in enumerate(df['review_body']):
new_row_vec = {}
bigrams = list(nltk.bigrams(text.split()))
for bigram in bigrams:
if bigram in bigram_vocabs:
j = bigram_vocabs[bigram] # get index of bigram-
new_row_vec[j] = new_row_vec.get(j, 0) + 1
X[i] = new_row_vec
# add bigram features to data_
data_[f'x'] = X
dataset_all = data_[['y','x']].to_dict(orient='records')
rows = []
cols = []
data = []
y = np.empty(n, dtype=np.int32)
for i, ex in enumerate(dataset_all):
y[i] = int(ex["y"])
for j, v in ex["x"].items():
rows.append(i)
cols.append(int(j))
data.append(float(v))
X = coo_matrix(
(np.array(data, dtype=np.float32),
(np.array(rows, dtype=np.int32),
np.array(cols, dtype=np.int32))),
shape=(n, d)
).tocsr()
Python
복사
What I’ve learned
•
Classical linear models are strong baselines for sentiment classification when paired with sparse n-gram features.
•
Text feature extraction often dominates the engineering complexity: the biggest challenge is not the classifier, but building an efficient representation for very high-dimensional sparse data.
•
Using NLTK, I’ve learned how to do lemmatization.
•
Bigram features can capture phrase-level sentiment signals (e.g., negations), but they require sparse vectorization and careful handling of vocabulary growth.
•
Model performance should be reported with multiple metrics (Accuracy / Precision / Recall / F1) because class-wise errors matter; F1 is especially useful when evaluating the balance between precision and recall.
•
For large-scale text classification, implementation details (CSR matrices, training time, scalability of the solver) are as important as the model choice itself.

