NLP Pipelines & Text Classification Methods

Published at

2026/02/20

Last edited time

2026/02/21 01:39

Created

2026/02/20 21:51

Section

NLP & Prompt Enginnering

Status

Done

Series

From the Bottom

NLP System Pipeline

Feature Extraction

•

For example,

◦

Sentiment Analysis

◦

Named Entity Recognition

•

Limitations of Expert System

◦

Different expertise for different tasks

◦

Expertise is not transferrable across tasks, languages, and domains

◦

⇒ solution is Deep Learning a.k.a Representation Leanrning

Machine Learning

•

Classification

◦

Linear Classifier / Logistic Regression / SVM

•

Structured Predition 

◦

Sequence Labeling / CNN / RNN / LSTM

•

Machine Translation 

◦

Seq2seq generation

◦

Transformer

•

Language Modeling

◦

Pre-training

◦

Post-training, RL

Text Classification Methods

•

Our goal: Build a sustem that predicts whether a document X, a input, belongs to one of C classes

•

Some text classification problems include:

◦

Sentiment analysis

◦

Hate speech detection

◦

Authorship analysis

•

How to build a Text Classifier

Assume we have training data

x^{(i)}, y ^{(i)}

where i ranges from 1 to

N

Each input

x

is a document → Documents can have different numbers of words

Each training example has corresponding label

y^{(i)}

Pre-Processing

•

Goal: Convert data into a standardized form that our models can easily ingest

•

Tokenization: splitting the text into units for processing

◦

e.g. Removing extra spaces, Removing “unhelpful” text, Splitting punctuation

•

Other optional operations include

◦

Contracting and standardizing (e.g. won’t → will not)

◦

Converting capital letters to lowercase (His → his)

◦

Removing stopwords (a, the, about …)

◦

Stemming or lemmatization (e.g. running → run; poorly → poor)

•

Why we do standadization? It helps Generalization

◦

But, as a trade-off: we lose information

Naive Bayes

•

We model

p(y)p(y)p(y): For each label yyy. what is the probability of y occurring?

p(x∣y)p(x|y)p(x∣y): For each label y, what corresponding xxx’s are likely to appear?

Modeling Naive Bayes

•

using Bayes Rules, 

p(y|x) = \frac{p(x|y)p(y)}{p(x)}

We can predict the class by choosing the label

y

that maximizes

p(y|x)

\hat{y} = \arg\max_y p(y|x) = \arg\max_y \frac{p(x|y)p(y)}{p(x)} = \arg\max_y p(x|y)p(y)

Modeling P(y)P(y)P(y)

•

modeling p(y)p(y)p(y) is easy: just count how often each y appears

C

: the # of possible classes

Our model learn model parameter

\pi_j = P(y=j)

for each possible

j

◦

Learning: πj=count(y=j)/n\pi_j = count(y=j)/nπj​=count(y=j)/n

▪

nnn: # of training examples

▪

count(y=j)count(y=j)count(y=j): how often y=j in training data

Modeling P(x∣y)P(x|y)P(x∣y)

→ In this part, we are using Naive Bayes Method

•

Idea: Make a simplifying assumption about p(x|y) to make it possible to estimate

•

Assumption: Each word of the co x is conditionally independent given label y:

◦

Note: This assumption does not have to be true, just has to be “close enough” so that classifier makes reasonable predictions

•

Navie Bayes posits its won probabilistic story about how the data was generated

•

Process

Each y(i)y^{(i)}y(i) was sampled from the prior distribution p(y)p(y)p(y)

Each word in x(i)x^{(i)}x(i) was ampled independently from the word distribution for label y(i)y^{(i)}y(i)

•

Why is the Naive Bayes Assumption OK?

Naïve Bayes assumes:

Once the label y is chosen, each word is generated independently.

So it assumes something like:

positive document generation:
great the movie good score great the ...

negative document generation:
bad worst movie is terrible worst bad ...
Plain Text
복사

This is unrealistic. Real sentences are:

"the movie was great"
"the acting was terrible"
Plain Text
복사

Words clearly depend on each other.

For example:

◦

"New" → likely followed by "York"

◦

"machine" → likely followed by "learning"

Not independent.

So clearly, Naïve Bayes assumption is wrong.

But, Why is the Naive Bayes Assumption OK?

since w e don’t need exact probabilities. We only need the correct class to have higher probability.

p(x|y_{true}) > p(x|y_{other})

For example,

Learning with Naive Bayes

•

How to learn? Just count occurences of w

◦

Note: this formula has a flaw, we will fix it soon

•

Model learns parameter

◦

Total of ∣V∣∗C|V| * C∣V∣∗C parameters to learn

▪

VVV denote the set of words in the dictionary

See the below example,

We begin with labeled text documents:

i	y(i)	x(i)
1	+1	great acting and score
2	-1	terrible directing
3	+1	great movie
4	-1	terrible
5	+1	amazing

Where:

•

x(i) = document (sequence of words)

•

y(i) = label

◦

+1 → positive

◦

−1 → negative

Our goal is to learn:

P(\text{word} \mid y=+1) \quad \text{and} \quad P(\text{word} \mid y=-1)

These are called the likelihood parameters and are denoted:

\tau_{w,1} = P(w \mid y=+1)

\tau_{w,-1} = P(w \mid y=-1)

In Positive documents:

great acting and score
great movie
amazing
Plain Text
복사

Count each word:

word	count
great	2
acting	1
and	1
score	1
movie	1
amazing	1
directing	0
terrible	0

Total words in positive class: 7

Let’s convert Counts into Probabilities

Formula:

P(w∣y=+1)=count of w in positive docstotal positive wordsP(w \mid y=+1) = \frac{\text{count of w in positive docs}}{\text{total positive words}}

P(w∣y=+1)=total positive wordscount of w in positive docs

So:

word	probability
acting	1/7
and	1/7
amazing	1/7
directing	0
great	2/7
movie	1/7
score	1/7
terrible	0

This produces the green table in the image.

Same in Negative examples.

Predicting with Naive Bayes

Problems of Naive Bayes

Too Many Zeros → Laplace Smmothing

What if both

p(x, y=+1)

and

p(x, y=-1)

have zero value?

By Bayes Rule,

p(y=1|x) = 0/(0+0) = NaN

But the model assign probability of 0 to many (word, label) pairs.

•

Solution: Laplace Smoothing

\lambda

is a new hyperparameter

◦

Imagine that every (word, label) pair was seem an additional λ\lambdaλ times

Numerical Underflow → Using Log Space

Given long test example, the probability goes underflow.

Since multiplying many small numbers results in numerical underflow, and the result is so small that it becomes 0

•

Solution: using log

log \ p(x, y=j)

Summary

Logistic & Softmax Regression

•

First decide on a formula we will use to make predictions

◦

Formula contains some numerical parameters which determine its output

◦

Optimize the parameters so that we make good predictions on the training data

◦

Discriminative: Focuses only on discriminating positives & nagatives

▪

vs Naive Bayes (Generative Approach) : it models the entire process of generating (x,y)

Predicting with Logistic Regression

Convert the document x to a vector of features ϕ(x)\phi(x)ϕ(x) of length ddd

Choose (in advance) a weight vector w of length d and a bias b

•

so, our params are w, d, b

Compute a score as follows:

s = (d\sum^d_{j=1}{w_j . \phi(x)_j}) + b = w^T \phi(x) + b

Note: this is a linear model b/c score is linear function of features

Predict y=+1y=+1y=+1 if score > 0, otherwise y=−1y = -1y=−1

Feature Extraction Phase

•

Example: Unigram Count Features

◦

First construct a vocab: set of all words in the training data

◦

Unigram Count Features

▪

Pros

•

We can learn how each word influences label

•

Very simple

•

Can work reasonably well for some tasks

•

# of params we have to learn is not that large

▪

Cons

•

We lose all information about order of words

•

Doesn’t really captrue what the sentence means

Machine Learning Phase

•

How to choose w & b?

◦

Idea: Optimize w & b to ensure our predictions are accurate on the training data

◦

What does "good" mean?

▪

Good: make correct predictions

▪

Better: make confidently correct predictions

•

Score should be very high when y = +1. Score should be very low when y = -1

◦

Our approach: convert scores to probabilities and maximize the probability of the correct answer

cf) Sigmoid Function

•

Objective Function

◦

We want to optimize this objective function with respect to w & b

◦

for binary classification,

And tkae log and negate

•

Optimization

◦

Using Gradient Descent

Support Vector Machines (SVM)

•

This is an another method for training a linear classifier

◦

Also learns w, b

•

Intuition: Find a separating hyperplane with large margin

•

Location of hyperplane winds up depending on support vectors

Multi-class / Multinomial Logistic Regression

•

what if we have C > 2 classes?

◦

using Softmax Function

Biagram

•

Bigram = pair of consecutive words

A bigram is simply two consecutive words that appear together. For example, in "great movie," the bigram is ["great", "movie"].

When you use bigram features, you create a separate feature for each possible pair of words. If your vocabulary has |V| words, the number of possible bigrams is |V|² (vocabulary size squared). This means if you have 10,000 words in your vocabulary, you'd have 100 million possible bigram features.

For each bigram feature, the model learns a separate weight parameter. This allows the model to capture patterns like:

◦

"not good" (negative sentiment) vs "very good" (positive sentiment)

◦

"highly recommend" (strong positive) vs "do not" (likely negative)

In practice, you typically use both unigram and bigram features together. This gives your model:

◦

Unigrams: Individual word importance (e.g., "terrible" is negative)

◦

Bigrams: Word order and context (e.g., "not terrible" changes the meaning)

◦

Pros and cons

Discriminative vs. Generative Comparison

Model Evaluation

Our goal is always to build a classifier that can calssify any document

•

Model will overfit to the training data

•

We need to test generalization to unseen examples 

•

Split dataset into three parts

◦

Training set: Train the model

◦

Development set (Validation set): Select hyperparams

◦

Test set : evaluate final model’s performance

Evaluation Metrics

Conclusion

NLP - Lecture Summary

Wiki

Name

AI summary