•
2-layer neural networks consist of a hidden layer (multiple logistic regression classifiers) and a final layer (linear for binary, softmax for multi-class)
•
Non-linear activation functions (sigmoid, tanh, ReLU) are essential for networks to learn complex patterns; without them, stacked layers collapse into one linear equation
•
Deep networks (more layers) generally outperform wide networks (more units per layer) in practice
•
CNNs apply convolutional filters over fixed-size windows, creating learned n-gram features that generalize to unseen word sequences via word embeddings
•
SGD partitions training data into batches; larger batches yield more accurate gradients but slower updates, while smaller batches are faster but noisier
•
Proper weight initialization (He, Xavier, or PyTorch default uniform) prevents symmetry and ensures effective training
•
Learning rate is critical: too high causes divergence, too low leads to slow convergence; learning rate schedules can improve optimization
•
Momentum and adaptive learning rates (e.g., Adam) help SGD overcome local minima and navigate non-convex loss surfaces more effectively
Neural Nets
MLP
•
2-layer NN for Binary Classification
◦
Hidden Layer = A bunch of Logistic regression classifiers
The logistic regression equation is:
where:
▪
σ is the sigmoid function
▪
w is the weight vector
▪
x is the input feature vector
▪
b is the bias term
◦
Final Layer = linear Model
•
What about 2-layer NN for Multi-Class Classification?
◦
Final Layer is changed to Softmax Regression (Only final layer changes when changing to a different task)
For multi-class classification, we use the softmax function to convert the output logits into probability distributions:
where:
▪
z_k = w_k^T x + b_k is the logit for class k
▪
K is the total number of classes
▪
The softmax function ensures that all probabilities sum to 1
The predicted class is: ŷ = argmax P(y=k|x)
In Matrix Form, Hidden Layer (Layer 1):
where:
▪
h is the hidden layer activation vector
▪
W_1 is the weight matrix for the first layer
▪
x is the input feature vector
▪
b_1 is the bias vector for the first layer
▪
σ is the activation function (e.g., sigmoid, ReLU)
Output Layer (Layer 2):
where:
▪
z is the output logit vector
▪
W_2 is the weight matrix for the second layer
▪
h is the hidden layer activation from Layer 1
▪
b_2 is the bias vector for the second layer
Final Prediction:
For binary classification:
For multi-class classification:
•
Multi Layer Perceptron
We can add more layers!
◦
In practive, makeing networks deeper (means adding more layers) often helps more than making them “wider” (more hidden units in each layer)
◦
Layers are fully connected as each neuron depends on every neuron in previous layer
3 Views for DL
1. DL as Learnable Non-Linear Functions (NNs as Non-linear Functions)
•
NNs compute a non-linear function of input to make predictions
◦
Non-linear Functions
▪
Sigmoid
▪
Tanh
▪
ReLU
▪
In parctice: tanh and ReLU often preferred
•
Tanh: Better than sigmoid because outputs centered around zero
•
ReLU: very fast to compute
•
Why non-linearity is important?
◦
Without non-linear functions, whole networks is just one simple linear equation.
◦
Having a simple non-linear function between the two linear operations enables us to learn a complex non-linear function
2. DL as Feature Learning (NNs as Feature Learners)
•
Non linearities make NN’s more expressive
◦
XOR Problem
The XOR (exclusive OR) problem is a classic example that demonstrates why neural networks need non-linearity:
▪
XOR Truth Table:
•
Input: (0,0) → Output: 0
•
Input: (0,1) → Output: 1
•
Input: (1,0) → Output: 1
•
Input: (1,1) → Output: 0
▪
The problem: XOR is not linearly separableYou cannot draw a single straight line to separate the points where output is 1 from where output is 0
▪
Why 2-layer NN solves it:
•
A single-layer perceptron (linear classifier) cannot solve XOR
•
A 2-layer network with non-linear activation functions can learn to solve XOR
•
The hidden layer learns to transform the input space into a representation where the classes become linearly separable
•
Example architecture: 2 inputs → 2 hidden units (with sigmoid/tanh) → 1 output unit
This demonstrates that adding even one hidden layer with non-linearity dramatically increases the expressiveness of neural networks.
◦
Modeling Negation with an MLP
3. DL as Assembling Building Blocks (NNs as a Set of Bulilding Blocks)
•
Architecture: Arrangement of layers is called as Architecture
•
In this view, DL is to design suitable neural architectures with various reusable building blocks
•
Power of DL: You can stack building blocks together any way you want
•
Building Blocks
1.
Linear Block
•
Input x : Vector of dimension
•
Output y : Vector of dimension
•
Params
◦
W: x
◦
b: vector
nn.Linear()
2.
Non-linearity Block
•
Input x: Any number/vector/matrix
•
Output y: Number/vector/matrix of same shape
•
Possible formulas
◦
Sigmoid / Tanh / Relu , elementwise
•
Params: None
torch.sigmoid(), nn.functional.relu()
3.
Word Vector Block
•
Input w: A word
•
Output: A vector of length d
•
Formula: Return word_vecs[w]
•
Params:
◦
For each word w in vocab, there is a word vector param of shape
◦
total params needed
4.
Mean Pooling Block
•
Input x: Matrix of shape (d, L)
•
Output y: Tensor of dimension d
•
Formula: Average all the vecs along the “L” dim
torch.mean()
•
Why mean pooling?
◦
Documents have variable length
•
2 Layer MLP
◦
Linear Layer #1
◦
Non-Linearity
◦
Linear Layer #2
◦
Parameters of model = Total Params across all blocks
◦
A NN Text Classifier
▪
It’s a kind of Deep Averaging Network (DAN)
•
Paper shows this outperforms methods like Naive Bayes on sentiment analysis
Convolution Neural Networks
What if train a neural network to look at fixed-size windows?
•
Run that same network on every such window
•
Mean pool the resulting feature outputs
⇒ CNN
CNNs can create neural bag-of-n-grams features
•
Each intermediate feature vector can combine information from all words in the window→ A convolution filter looks at a fixed-size window (e.g., 3 words) and produces one feature that summarizes their joint information.→ This acts like a learned n-gram detector rather than a manually defined one.
•
Word vectors help us process unseen n-grams
◦
Why?
Traditional n-gram indicator features treat each exact word sequence as a completely independent feature. If the model has never seen a specific sequence during training, its feature value is zero. There is no notion of similarity between sequences.
However, in CNN-based models, each word is represented as a dense embedding vector. Semantically similar words (e.g., "like" and "love") are close to each other in embedding space.
When a convolution filter operates on embeddings, it does not depend on exact word identity. Instead, it computes a weighted combination of the embedding vectors within the window.
Therefore, if the model has seen "I really like spinach" and later encounters "I really love spinach"—even though the exact 3-gram "really love spinach" was never seen—the embeddings for "like" and "love" are similar. As a result, the convolution output will also be similar.
This allows the CNN to generalize beyond exact word sequences and detect patterns based on semantic similarity rather than exact matches.
In other words, CNNs with word embeddings learn soft, continuous n-gram features, whereas traditional n-gram models rely on hard, discrete matches. That is why word vectors help process unseen n-grams.
Optimization for DL
Stochastic Gradient Descent
•
In practice, partition training set into batches liek:
◦
Desired batch size is another hyperparameter to tune
▪
Larger batch size = more accurate gradient, but slower
▪
Smaller batch size = faster, but may wander in duboptimal directions
•
SGD is most useful when training data is large, computing full gradient is expensive
•
In SGD, each parameter update is only approximately going towards the minimum
◦
But given enough time, you’ll end up in (almost) the same place
▪
+ each step is much faster
•
each step is much faster
Initialization
•
Neural Networks are non-convex so that choice of optimization method really matters
◦
Different optimization techniques will converge to different local optima, some of which are much better than others
◦
Where you start determines what parameters you learn
•
Initial Approach: All-0’s init
•
How to Initialize neural networks
◦
Options
▪
He initialization:
▪
Xavier initialization:
▪
Pytorch default:
•
Uniform avoids large outliers!
▪
Usually you don’t tune these as hyperparameters, just use defaults
Importance of Learning Rate
•
For NN, learning rate matters a lot
•
Learning Rate Schedules
Two ways to improve SGD
•
Challenges for SGD
•
Improving it,
Computing Gradients
NLP - Lecture Summary
Search



























