Changyu Lee

Neural Nets

Published at
2026/02/22
Last edited time
2026/02/23 07:38
Created
2026/02/23 04:49
Section
NLP & Prompt Enginnering
Status
Done
Series
From the Bottom
Tags
Lecture Summary
AI summary
A practical overview of neural networks covers MLPs, activation functions, and CNNs. Key points include the importance of non-linear activation functions for learning complex patterns, the effectiveness of deeper networks over wider ones, and the role of SGD in optimization. Proper weight initialization and learning rate management are critical for training success, while CNNs leverage word embeddings for better generalization in feature learning.
Keywords
NLP
Language
ENG
Week
26-2-2
1 more property
2-layer neural networks consist of a hidden layer (multiple logistic regression classifiers) and a final layer (linear for binary, softmax for multi-class)
Non-linear activation functions (sigmoid, tanh, ReLU) are essential for networks to learn complex patterns; without them, stacked layers collapse into one linear equation
Deep networks (more layers) generally outperform wide networks (more units per layer) in practice
CNNs apply convolutional filters over fixed-size windows, creating learned n-gram features that generalize to unseen word sequences via word embeddings
SGD partitions training data into batches; larger batches yield more accurate gradients but slower updates, while smaller batches are faster but noisier
Proper weight initialization (He, Xavier, or PyTorch default uniform) prevents symmetry and ensures effective training
Learning rate is critical: too high causes divergence, too low leads to slow convergence; learning rate schedules can improve optimization
Momentum and adaptive learning rates (e.g., Adam) help SGD overcome local minima and navigate non-convex loss surfaces more effectively

Neural Nets

MLP

2-layer NN for Binary Classification
Hidden Layer = A bunch of Logistic regression classifiers
The logistic regression equation is:
P(y=1x)=σ(wTx+b)=11+e(wTx+b)P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}
where:
σ is the sigmoid function
w is the weight vector
x is the input feature vector
b is the bias term
Final Layer = linear Model
What about 2-layer NN for Multi-Class Classification?
Final Layer is changed to Softmax Regression (Only final layer changes when changing to a different task)
For multi-class classification, we use the softmax function to convert the output logits into probability distributions:
P(y=kx)=ezkj=1KezjP(y=k|x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}
where:
z_k = w_k^T x + b_k is the logit for class k
K is the total number of classes
The softmax function ensures that all probabilities sum to 1
The predicted class is: ŷ = argmax P(y=k|x)
In Matrix Form, Hidden Layer (Layer 1):
h=σ(W1x+b1)h = \sigma(W_1 x + b_1)
where:
h is the hidden layer activation vector
W_1 is the weight matrix for the first layer
x is the input feature vector
b_1 is the bias vector for the first layer
σ is the activation function (e.g., sigmoid, ReLU)
Output Layer (Layer 2):
z=W2h+b2z = W_2 h + b_2
where:
z is the output logit vector
W_2 is the weight matrix for the second layer
h is the hidden layer activation from Layer 1
b_2 is the bias vector for the second layer
Final Prediction:
For binary classification:
y^=σ(z)=11+ez\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}
For multi-class classification:
P(y=kx)=softmax(z)k=ezkj=1KezjP(y=k|x) = \text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}
Multi Layer Perceptron
We can add more layers!
In practive, makeing networks deeper (means adding more layers) often helps more than making them “wider” (more hidden units in each layer)
Layers are fully connected as each neuron depends on every neuron in previous layer

3 Views for DL

1. DL as Learnable Non-Linear Functions (NNs as Non-linear Functions)

NNs compute a non-linear function of input to make predictions
Non-linear Functions
Sigmoid
σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
Tanh
tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
ReLU
ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)
In parctice: tanh and ReLU often preferred
Tanh: Better than sigmoid because outputs centered around zero
ReLU: very fast to compute
Why non-linearity is important?
Without non-linear functions, whole networks is just one simple linear equation.
Having a simple non-linear function between the two linear operations enables us to learn a complex non-linear function

2. DL as Feature Learning (NNs as Feature Learners)

Non linearities make NN’s more expressive
XOR Problem
The XOR (exclusive OR) problem is a classic example that demonstrates why neural networks need non-linearity:
XOR Truth Table:
Input: (0,0) → Output: 0
Input: (0,1) → Output: 1
Input: (1,0) → Output: 1
Input: (1,1) → Output: 0
The problem: XOR is not linearly separableYou cannot draw a single straight line to separate the points where output is 1 from where output is 0
Why 2-layer NN solves it:
A single-layer perceptron (linear classifier) cannot solve XOR
A 2-layer network with non-linear activation functions can learn to solve XOR
The hidden layer learns to transform the input space into a representation where the classes become linearly separable
Example architecture: 2 inputs → 2 hidden units (with sigmoid/tanh) → 1 output unit
This demonstrates that adding even one hidden layer with non-linearity dramatically increases the expressiveness of neural networks.
Modeling Negation with an MLP

3. DL as Assembling Building Blocks (NNs as a Set of Bulilding Blocks)

Architecture: Arrangement of layers is called as Architecture
In this view, DL is to design suitable neural architectures with various reusable building blocks
Power of DL: You can stack building blocks together any way you want
Building Blocks
1.
Linear Block
Input x : Vector of dimension dind_{in}
Output y : Vector of dimension doutd_{out}
Params
W: doutd_{out} x dind_{in}
b: doutd_{out} vector
nn.Linear()
2.
Non-linearity Block
Input x: Any number/vector/matrix
Output y: Number/vector/matrix of same shape
Possible formulas
Sigmoid / Tanh / Relu , elementwise
Params: None
torch.sigmoid(), nn.functional.relu()
3.
Word Vector Block
Input w: A word
Output: A vector of length d
Formula: Return word_vecs[w]
Params:
For each word w in vocab, there is a word vector param vwv_w of shape dd
Vd|V| * d total params needed
4.
Mean Pooling Block
Input x: Matrix of shape (d, L)
Output y: Tensor of dimension d
Formula: Average all the vecs along the “L” dim
torch.mean()
Why mean pooling?
Documents have variable length
2 Layer MLP
Linear Layer #1
Non-Linearity
Linear Layer #2
Parameters of model = Total Params across all blocks
A NN Text Classifier
It’s a kind of Deep Averaging Network (DAN)
Paper shows this outperforms methods like Naive Bayes on sentiment analysis

Convolution Neural Networks

What if train a neural network to look at fixed-size windows?
Run that same network on every such window
Mean pool the resulting feature outputs
⇒ CNN
CNNs can create neural bag-of-n-grams features
Each intermediate feature vector can combine information from all words in the window→ A convolution filter looks at a fixed-size window (e.g., 3 words) and produces one feature that summarizes their joint information.→ This acts like a learned n-gram detector rather than a manually defined one.
Word vectors help us process unseen n-grams
Why?
Traditional n-gram indicator features treat each exact word sequence as a completely independent feature. If the model has never seen a specific sequence during training, its feature value is zero. There is no notion of similarity between sequences.
However, in CNN-based models, each word is represented as a dense embedding vector. Semantically similar words (e.g., "like" and "love") are close to each other in embedding space.
When a convolution filter operates on embeddings, it does not depend on exact word identity. Instead, it computes a weighted combination of the embedding vectors within the window.
Therefore, if the model has seen "I really like spinach" and later encounters "I really love spinach"—even though the exact 3-gram "really love spinach" was never seen—the embeddings for "like" and "love" are similar. As a result, the convolution output will also be similar.
This allows the CNN to generalize beyond exact word sequences and detect patterns based on semantic similarity rather than exact matches.
In other words, CNNs with word embeddings learn soft, continuous n-gram features, whereas traditional n-gram models rely on hard, discrete matches. That is why word vectors help process unseen n-grams.

Optimization for DL

Stochastic Gradient Descent

In practice, partition training set into batches liek:
Desired batch size is another hyperparameter to tune
Larger batch size = more accurate gradient, but slower
Smaller batch size = faster, but may wander in duboptimal directions
SGD is most useful when training data is large, computing full gradient is expensive
In SGD, each parameter update is only approximately going towards the minimum
But given enough time, you’ll end up in (almost) the same place
+ each step is much faster
each step is much faster

Initialization

Neural Networks are non-convex so that choice of optimization method really matters
Different optimization techniques will converge to different local optima, some of which are much better than others
Where you start determines what parameters you learn
Initial Approach: All-0’s init
How to Initialize neural networks
Options
He initialization: WN(0,2din)W \sim \mathcal{N}(0, \frac{2}{d_{in}})
Xavier initialization: WN(0,1din)W \sim \mathcal{N}(0, \frac{1}{d_{in}})
Pytorch default: WU(1din,1din)W \sim \mathcal{U}(-\frac{1}{\sqrt{d_{in}}}, \frac{1}{\sqrt{d_{in}}})
Uniform avoids large outliers!
Usually you don’t tune these as hyperparameters, just use defaults

Importance of Learning Rate

For NN, learning rate matters a lot
Learning Rate Schedules

Two ways to improve SGD

Challenges for SGD
Improving it,

Computing Gradients

NLP - Lecture Summary
Search
Wiki
Name
AI summary
Created
ID
Keywords
Language
Last edited time
Published at
Section
Series
Status
Tags
Week
NLP systems integrate feature extraction and machine learning, with deep learning replacing manual feature engineering. Text classification can be approached through generative methods like Naive Bayes or discriminative methods such as Logistic Regression and SVM. Naive Bayes assumes conditional independence of words given labels, while Logistic Regression optimizes parameters to maximize log-likelihood. Effective pre-processing, including tokenization and standardization, is crucial for model performance. Challenges like zero probabilities and numerical underflow in Naive Bayes can be addressed with Laplace smoothing and log space. Model evaluation is essential to ensure generalization to unseen data.
2026/02/20 21:51
NLP-260220-1351-LE-NLP
NLP
ENG
2026/02/21 01:39
2026/02/20
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-1
Word vectors represent words as numerical vectors capturing semantic meaning, with similar words having similar vectors. Word2Vec learns these vectors by predicting co-occurrence patterns, optimizing both word and context vectors through gradient descent. Key applications include solving analogies, measuring similarity via cosine similarity, and converting variable-length documents into fixed-size vectors using mean pooling. Limitations include challenges with polysemy and biases in training data, which can reflect historical correlations rather than true semantic relationships.
2026/02/21 00:25
NLP-260220-1625-LE-WORDV
NLP
ENG
2026/02/23 07:38
2026/02/20
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-2-1
A practical overview of neural networks covers MLPs, activation functions, and CNNs. Key points include the importance of non-linear activation functions for learning complex patterns, the effectiveness of deeper networks over wider ones, and the role of SGD in optimization. Proper weight initialization and learning rate management are critical for training success, while CNNs leverage word embeddings for better generalization in feature learning.
2026/02/23 04:49
NLP-260222-2049-LE-NLP
NLP
ENG
2026/02/23 07:38
2026/02/22
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-2-2
Recurrent Neural Networks (RNNs) are essential for sequence labeling tasks, addressing issues like context dependency and gradient problems through advanced architectures like LSTMs and Bidirectional RNNs. LSTMs improve long-term memory retention and mitigate the vanishing gradient issue, while combining LSTMs with CNNs and CRFs enhances performance in structured prediction tasks. Key techniques include POS tagging, the use of structured models to capture dependencies, and the application of Maximum Entropy Markov Models and Conditional Random Fields for effective labeling.
2026/02/23 07:41
NLP-260222-2341-LE-NLP
NLP
ENG
2026/02/23 22:36
2026/02/22
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-3-1
2026/02/23 22:58
NLP-260223-1458-LE-NLP
NLP
ENG
2026/02/24 03:01
2026/02/23
NLP & Prompt Enginnering
From the Bottom
In progress
Lecture Summary
26-3-2
2026/02/09 18:52
NLP-260209-1052-LE-NLP
NLP
ENG
2026/02/20 22:03
NLP & Prompt Enginnering
From the Bottom
In progress
Lecture Summary
26-4
2026/02/20 22:03
NLP-260220-1403-LE-PYTOR
NLP
ENG
2026/02/20 22:03
NLP & Prompt Enginnering
From the Bottom
In progress
Lecture Summary
26-4