Recurrent Nerual Networks for Sequence Labeling

Published at

2026/02/22

Last edited time

2026/02/23 22:36

Created

2026/02/23 07:41

Section

NLP & Prompt Enginnering

Status

Done

Series

From the Bottom

Structured Prediction & Sequence Labeling

Structured Prediction

•

Y consists of multiple components Y = {y1, y2, y3}

•

(Strong) correlations between output components

◦

e.g. words in one sentence

•

Exponential output space

•

Types of Sequence Labeling

◦

Part-of-speach Tagging

◦

Named Entity Recoginition

◦

Information Extraction

Part-of Speeach (POS) Tagging

•

Word classes or syntatic categories

•

Reveal useful information about the syntatic role of a word

◦

Different words have different syntatic functions

◦

It’s DIsambiguation task

▪

Each word might have different senses/functions

◦

There are 45 tags in English (Penn Tree Bank Tagset)

•

POS Tagging has not solved yet

•

In POS tagging, we can observe that

The function (or POS) of a word depends on its context

Certain POS combinations are extremely unlikely

Better to make predictions on entire sentences instead of individual words

•

Do we need structured models if the feature representations of the input sentence is perfect?

◦

In theory, no. If the feature representation of each word perfectly captures all contextual information in the sentence, we could predict each tag independently. In that case, modeling the joint probability of the tag sequence would not be necessary.

◦

However, in practice, a word’s correct tag often depends not only on the input words but also on neighboring tags. For example, certain POS tags are more likely to follow specific tags (e.g., a determiner is often followed by a noun).

◦

Therefore, we cannot safely assume that tag predictions are fully independent.

•

So, we need structured models that

◦

explicitly model dependencies between output tags,

◦

capture global consistency across the entire sequence,

◦

and learn transition patterns such as P(yj∣yj−1)P(y_j \mid y_{j-1})P(yj​∣yj−1​)

•

One feature vector for each word

Recurrent Neural Network (RNN)

•

Problems of Simple RNN

No future contexts

•

Future information is important for sequence labeling tasks

Inefficient (Sequential Computation)

Hard to train → Gradient Varnishing/ Exploding

•

Why is this not a serious problem for multi-layer FFN/CNN?

No repeated weight multiplication: In RNNs, the same weight matrix is multiplied repeatedly across many time steps, which can exponentially amplify or diminish gradients. In FFNs/CNNs, each layer has different weights, so this compounding effect doesn't occur in the same way.

Skip connections and normalization: Modern architectures like ResNet use skip connections and batch normalization, which help gradients flow more smoothly through deep networks.

Independent layer computations: Each layer's computation is independent of previous time steps, making the gradient flow more stable compared to the temporal dependencies in RNNs.

Limited size of hidden states (Memory Cost)

Bidirectional RNNs (Solving Prob#1)

Advanced RNN Variants (Soving Prob#3)

LSTMs for Sequence Labeling

LSTM: Long Short-Term Memory

•

Key idea of LSTM

◦

LSTM introduces a separate cell state that acts as long-term memory.

◦

Information can flow through this cell state with minimal modification.

◦

Gates control what to forget, what to add, and what to output.

•

Input gate: controls how much new information is written into memory. It decides which parts of the candidate vector should be stored.

◦

it=σ(Wixt+Uiht−1+bi)i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)it​=σ(Wi​xt​+Ui​ht−1​+bi​)

•

Forget gate: controls how much of the previous memory should be kept. If the value is close to 0, information is forgotten. If it is close to 1, information is preserved.

◦

ft=σ(Wfxt+Ufht−1+bf)f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)ft​=σ(Wf​xt​+Uf​ht−1​+bf​)

•

Candidate vector: a temporary vector containing new information computed from the current input and previous hidden state. It is combined with the input gate before being added to memory.

◦

gt=tanh⁡(Wgxt+Ught−1+bg)g_t = \tanh(W_g x_t + U_g h_{t-1} + b_g)gt​=tanh(Wg​xt​+Ug​ht−1​+bg​)

•

Cell state update: the new cell state is formed by keeping part of the old memory (controlled by the forget gate) and adding selected new information (controlled by the input gate).

◦

Ct=ft⊙Ct−1+it⊙gtC_t = f_t \odot C_{t-1} + i_t \odot g_tCt​=ft​⊙Ct−1​+it​⊙gt​

•

Output gate: controls how much of the internal memory is exposed as the hidden state. Not all stored information needs to be shown at every time step.

◦

ot=σ(Woxt+Uoht−1+bo)o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)ot​=σ(Wo​xt​+Uo​ht−1​+bo​)

•

Final hidden state: the hidden state is a filtered version of the cell state. It is passed to the next time step or next layer.

◦

ht=ot⊙tanh⁡(Ct)h_t = o_t \odot \tanh(C_t)ht​=ot​⊙tanh(Ct​)

•

Why LSTM works better

◦

the additive memory update helps reduce the vanishing gradient problem. 

◦

The gating mechanism allows the model to selectively remember important information. This enables learning long-range dependencies in text, speech, and time-series data.

•

For Sequence Labeling, a simple bidirectional LSTM model is not good enough

Bidirctional LSTMs + CNNs for Sequnce Labeling

BiLSTM-CNNs-CRF for Sequence Labeling

•

BiLSTM + Char-level CNN is not strong enough → How about combining structured models with LSTM?

•

Log-Linear Models

◦

What is features in Log-Linear Models

But there is feature sparsity problem. As we add richer features (bigrams, word–tag pairs, etc.), the feature space explodes.

▪

Most features are never observed

▪

Most feature vectors are extremely sparse

▪

We don’t have enough data to estimate all weights reliably

◦

We need independence assumptions to compute the nominator

◦

The denominator (normalizer) sums over all possible outputs.

▪

To compute that denominator efficiently, we need independence assumptions.

Maximum Entropy Markov Models (MEMMs)

•

for the independence assumption, use Markov Property (Markov Assumption)

In the denominator we are only summing over all possible tags for the current position j. → locally calculation

If the tag set size is K, then: the denominator costs O(K)

•

MEMMs has flexibility on combination of features

Conditional Random Fields (CRFs)

•

It’s Globally Nomalized Model

◦

CRFs have weaker independence assumption than MEMMs

•

How to compute?

◦

Viterbi Algorithm

•

BLSTM-CNN-CRF

NLP - Lecture Summary

Wiki

Name

AI summary

Created

Keywords

Language

Last edited time

Published at

Section

Series

Status

Tags

Week

NLP Pipelines & Text Classification Methods

Open

NLP systems integrate feature extraction and machine learning, with deep learning replacing manual feature engineering. Text classification can be approached through generative methods like Naive Bayes or discriminative methods such as Logistic Regression and SVM. Naive Bayes assumes conditional independence of words given labels, while Logistic Regression optimizes parameters to maximize log-likelihood. Effective pre-processing, including tokenization and standardization, is crucial for model performance. Challenges like zero probabilities and numerical underflow in Naive Bayes can be addressed with Laplace smoothing and log space. Model evaluation is essential to ensure generalization to unseen data.

2026/02/20 21:51

NLP-260220-1351-LE-NLP

NLP

ENG

2026/02/21 01:39

2026/02/20

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-1

Word Vector

Open

Word vectors represent words as numerical vectors capturing semantic meaning, with similar words having similar vectors. Word2Vec learns these vectors by predicting co-occurrence patterns, optimizing both word and context vectors through gradient descent. Key applications include solving analogies, measuring similarity via cosine similarity, and converting variable-length documents into fixed-size vectors using mean pooling. Limitations include challenges with polysemy and biases in training data, which can reflect historical correlations rather than true semantic relationships.

2026/02/21 00:25

NLP-260220-1625-LE-WORDV

NLP

ENG

2026/02/23 07:38

2026/02/20

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-2-1

Neural Nets

Open

A practical overview of neural networks covers MLPs, activation functions, and CNNs. Key points include the importance of non-linear activation functions for learning complex patterns, the effectiveness of deeper networks over wider ones, and the role of SGD in optimization. Proper weight initialization and learning rate management are critical for training success, while CNNs leverage word embeddings for better generalization in feature learning.

2026/02/23 04:49

NLP-260222-2049-LE-NLP

NLP

ENG

2026/02/23 07:38

2026/02/22

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-2-2

Recurrent Nerual Networks for Sequence Labeling

Open

2026/02/23 07:41

NLP-260222-2341-LE-NLP

NLP

ENG

2026/02/23 22:36

2026/02/22

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-3-1

Seq2Seq & Neural Machine Translation

Open

2026/02/23 22:58

NLP-260223-1458-LE-NLP

NLP

ENG

2026/02/24 03:01

2026/02/23

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-3-2

Pytorch Basics

Open

2026/02/09 18:52

NLP-260209-1052-LE-NLP

NLP

ENG

2026/02/20 22:03

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-4

Transformer

Open

2026/02/20 22:03

NLP-260220-1403-LE-PYTOR

NLP

ENG

2026/02/20 22:03

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-4