Neural Nets

Published at

2026/02/22

Last edited time

2026/02/23 07:38

Created

2026/02/23 04:49

Section

NLP & Prompt Enginnering

Status

Done

Series

From the Bottom

Neural Nets

MLP

•

2-layer NN for Binary Classification

◦

Hidden Layer = A bunch of Logistic regression classifiers

The logistic regression equation is:

P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}

where:

▪

σ is the sigmoid function

▪

w is the weight vector

▪

x is the input feature vector

▪

b is the bias term

◦

Final Layer = linear Model

•

What about 2-layer NN for Multi-Class Classification? 

◦

Final Layer is changed to Softmax Regression (Only final layer changes when changing to a different task)

For multi-class classification, we use the softmax function to convert the output logits into probability distributions:

P(y=k|x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

where:

▪

z_k = w_k^T x + b_k is the logit for class k

▪

K is the total number of classes

▪

The softmax function ensures that all probabilities sum to 1

The predicted class is: ŷ = argmax P(y=k|x)

In Matrix Form, Hidden Layer (Layer 1):

h = \sigma(W_1 x + b_1)

where:

▪

h is the hidden layer activation vector

▪

W_1 is the weight matrix for the first layer

▪

x is the input feature vector

▪

b_1 is the bias vector for the first layer

▪

σ is the activation function (e.g., sigmoid, ReLU)

Output Layer (Layer 2):

z = W_2 h + b_2

where:

▪

z is the output logit vector

▪

W_2 is the weight matrix for the second layer

▪

h is the hidden layer activation from Layer 1

▪

b_2 is the bias vector for the second layer

Final Prediction:

For binary classification:

\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}

For multi-class classification:

P(y=k|x) = \text{softmax}(z)_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

•

Multi Layer Perceptron

We can add more layers!

◦

In practive, makeing networks deeper (means adding more layers) often helps more than making them “wider” (more hidden units in each layer)

◦

Layers are fully connected as each neuron depends on every neuron in previous layer

3 Views for DL

1. DL as Learnable Non-Linear Functions (NNs as Non-linear Functions)

•

NNs compute a non-linear function of input to make predictions

◦

Non-linear Functions

▪

Sigmoid

\sigma(x) = \frac{1}{1 + e^{-x}}

▪

Tanh

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

▪

ReLU

\text{ReLU}(x) = \max(0, x)

▪

In parctice: tanh and ReLU often preferred

•

Tanh: Better than sigmoid because outputs centered around zero

•

ReLU: very fast to compute

•

Why non-linearity is important?

◦

Without non-linear functions, whole networks is just one simple linear equation.

◦

Having a simple non-linear function between the two linear operations enables us to learn a complex non-linear function

2. DL as Feature Learning (NNs as Feature Learners)

•

Non linearities make NN’s more expressive

◦

XOR Problem

The XOR (exclusive OR) problem is a classic example that demonstrates why neural networks need non-linearity:

▪

XOR Truth Table:

•

Input: (0,0) → Output: 0

•

Input: (0,1) → Output: 1

•

Input: (1,0) → Output: 1

•

Input: (1,1) → Output: 0

▪

The problem: XOR is not linearly separableYou cannot draw a single straight line to separate the points where output is 1 from where output is 0

▪

Why 2-layer NN solves it:

•

A single-layer perceptron (linear classifier) cannot solve XOR

•

A 2-layer network with non-linear activation functions can learn to solve XOR

•

The hidden layer learns to transform the input space into a representation where the classes become linearly separable

•

Example architecture: 2 inputs → 2 hidden units (with sigmoid/tanh) → 1 output unit

This demonstrates that adding even one hidden layer with non-linearity dramatically increases the expressiveness of neural networks.

◦

Modeling Negation with an MLP

3. DL as Assembling Building Blocks (NNs as a Set of Bulilding Blocks)

•

Architecture: Arrangement of layers is called as Architecture

•

In this view, DL is to design suitable neural architectures with various reusable building blocks

•

Power of DL: You can stack building blocks together any way you want

•

Building Blocks

Linear Block

•

Input x : Vector of dimension dind_{in}din​

•

Output y : Vector of dimension doutd_{out}dout​

•

Params

◦

W: doutd_{out}dout​ x dind_{in}din​

◦

b: doutd_{out}dout​ vector

nn.Linear()

Non-linearity Block

•

Input x: Any number/vector/matrix

•

Output y: Number/vector/matrix of same shape

•

Possible formulas

◦

Sigmoid / Tanh / Relu , elementwise

•

Params: None

torch.sigmoid(), nn.functional.relu()

Word Vector Block

•

Input w: A word

•

Output: A vector of length d

•

Formula: Return word_vecs[w]

•

Params:

◦

For each word w in vocab, there is a word vector param vwv_wvw​ of shape ddd

◦

∣V∣∗d|V| * d∣V∣∗d total params needed

Mean Pooling Block

•

Input x: Matrix of shape (d, L)

•

Output y: Tensor of dimension d

•

Formula: Average all the vecs along the “L” dim

torch.mean()

•

Why mean pooling? 

◦

Documents have variable length

•

2 Layer MLP

◦

Linear Layer #1

◦

Non-Linearity

◦

Linear Layer #2

◦

Parameters of model = Total Params across all blocks 

◦

A NN Text Classifier

▪

It’s a kind of Deep Averaging Network (DAN)

•

Paper shows this outperforms methods like Naive Bayes on sentiment analysis

Convolution Neural Networks

What if train a neural network to look at fixed-size windows?

•

Run that same network on every such window

•

Mean pool the resulting feature outputs

⇒ CNN

CNNs can create neural bag-of-n-grams features

•

Each intermediate feature vector can combine information from all words in the window→ A convolution filter looks at a fixed-size window (e.g., 3 words) and produces one feature that summarizes their joint information.→ This acts like a learned n-gram detector rather than a manually defined one.

•

Word vectors help us process unseen n-grams

◦

Why?

Traditional n-gram indicator features treat each exact word sequence as a completely independent feature. If the model has never seen a specific sequence during training, its feature value is zero. There is no notion of similarity between sequences.

However, in CNN-based models, each word is represented as a dense embedding vector. Semantically similar words (e.g., "like" and "love") are close to each other in embedding space.

When a convolution filter operates on embeddings, it does not depend on exact word identity. Instead, it computes a weighted combination of the embedding vectors within the window.

Therefore, if the model has seen "I really like spinach" and later encounters "I really love spinach"—even though the exact 3-gram "really love spinach" was never seen—the embeddings for "like" and "love" are similar. As a result, the convolution output will also be similar.

This allows the CNN to generalize beyond exact word sequences and detect patterns based on semantic similarity rather than exact matches.

In other words, CNNs with word embeddings learn soft, continuous n-gram features, whereas traditional n-gram models rely on hard, discrete matches. That is why word vectors help process unseen n-grams.

Optimization for DL

Stochastic Gradient Descent

•

In practice, partition training set into batches liek:

◦

Desired batch size is another hyperparameter to tune

▪

Larger batch size = more accurate gradient, but slower

▪

Smaller batch size = faster, but may wander in duboptimal directions

•

SGD is most useful when training data is large, computing full gradient is expensive

•

In SGD, each parameter update is only approximately going towards the minimum

◦

But given enough time, you’ll end up in (almost) the same place

▪

+ each step is much faster

•

each step is much faster

Initialization

•

Neural Networks are non-convex so that choice of optimization method really matters

◦

Different optimization techniques will converge to different local optima, some of which are much better than others

◦

Where you start determines what parameters you learn

•

Initial Approach: All-0’s init

•

How to Initialize neural networks

◦

Options

▪

He initialization: W∼N(0,2din)W \sim \mathcal{N}(0, \frac{2}{d_{in}})W∼N(0,din​2​)

▪

Xavier initialization: W∼N(0,1din)W \sim \mathcal{N}(0, \frac{1}{d_{in}})W∼N(0,din​1​)

▪

Pytorch default: W∼U(−1din,1din)W \sim \mathcal{U}(-\frac{1}{\sqrt{d_{in}}}, \frac{1}{\sqrt{d_{in}}})W∼U(−din​​1​,din​​1​)

•

Uniform avoids large outliers!

▪

Usually you don’t tune these as hyperparameters, just use defaults

Importance of Learning Rate

•

For NN, learning rate matters a lot

•

Learning Rate Schedules

Two ways to improve SGD

•

Challenges for SGD

•

Improving it,

Computing Gradients

NLP - Lecture Summary

Wiki

Name

AI summary

Created

Keywords

Language

Last edited time

Published at

Section

Series

Status

Tags

Week

NLP Pipelines & Text Classification Methods

Open

NLP systems integrate feature extraction and machine learning, with deep learning replacing manual feature engineering. Text classification can be approached through generative methods like Naive Bayes or discriminative methods such as Logistic Regression and SVM. Naive Bayes assumes conditional independence of words given labels, while Logistic Regression optimizes parameters to maximize log-likelihood. Effective pre-processing, including tokenization and standardization, is crucial for model performance. Challenges like zero probabilities and numerical underflow in Naive Bayes can be addressed with Laplace smoothing and log space. Model evaluation is essential to ensure generalization to unseen data.

2026/02/20 21:51

NLP-260220-1351-LE-NLP

NLP

ENG

2026/02/21 01:39

2026/02/20

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-1

Word Vector

Open

Word vectors represent words as numerical vectors capturing semantic meaning, with similar words having similar vectors. Word2Vec learns these vectors by predicting co-occurrence patterns, optimizing both word and context vectors through gradient descent. Key applications include solving analogies, measuring similarity via cosine similarity, and converting variable-length documents into fixed-size vectors using mean pooling. Limitations include challenges with polysemy and biases in training data, which can reflect historical correlations rather than true semantic relationships.

2026/02/21 00:25

NLP-260220-1625-LE-WORDV

NLP

ENG

2026/02/23 07:38

2026/02/20

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-2-1

Neural Nets

Open

2026/02/23 04:49

NLP-260222-2049-LE-NLP

NLP

ENG

2026/02/23 07:38

2026/02/22

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-2-2

Recurrent Nerual Networks for Sequence Labeling

Open

Recurrent Neural Networks (RNNs) are essential for sequence labeling tasks, addressing issues like context dependency and gradient problems through advanced architectures like LSTMs and Bidirectional RNNs. LSTMs improve long-term memory retention and mitigate the vanishing gradient issue, while combining LSTMs with CNNs and CRFs enhances performance in structured prediction tasks. Key techniques include POS tagging, the use of structured models to capture dependencies, and the application of Maximum Entropy Markov Models and Conditional Random Fields for effective labeling.

2026/02/23 07:41

NLP-260222-2341-LE-NLP

NLP

ENG

2026/02/23 22:36

2026/02/22

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-3-1

Seq2Seq & Neural Machine Translation

Open

2026/02/23 22:58

NLP-260223-1458-LE-NLP

NLP

ENG

2026/02/24 03:01

2026/02/23

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-3-2

Pytorch Basics

Open

2026/02/09 18:52

NLP-260209-1052-LE-NLP

NLP

ENG

2026/02/20 22:03

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-4

Transformer

Open

2026/02/20 22:03

NLP-260220-1403-LE-PYTOR

NLP

ENG

2026/02/20 22:03

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-4