Word Vector

Published at

2026/02/20

Last edited time

2026/02/23 07:38

Created

2026/02/21 00:25

Section

NLP & Prompt Enginnering

Status

Done

Series

From the Bottom

Word Vectors

•

Definition: A vector that represents word’s meaning

◦

Similar words should have similar vectros

◦

Diffrent components of the vector may represent different properties of a word

•

Word vectors can be used features for NLP models

•

To feed neural nets sentences, need to represent each word as a vector

Another View: A Word Vector Layer as a Function

•

A paramatric function from vocabulary words to vectors

•

Input: A word

•

Output: A vector of length d

•

Formula: Return word_vecs[w]

•

Parameters:

◦

For each word, A word vector vwv_wvw​ of shape ddd

◦

∣V∣∗d|V| * d∣V∣∗d total params needed

Lexical Semantics

•

Word vectors should capture lexical semantics (word-level meaning)

◦

Synonymy or Antonymy

▪

Example: "happy" vs "joyful" (synonyms), "hot" vs "cold" (antonyms)

◦

Hypernymy / Hyponymy

▪

Example: "animal" (hypernym) vs "dog" (hyponym)

◦

Similarity

▪

Example: "king" and "queen" are similar in meaning

◦

Various features

▪

Sentiment

•

Example: "excellent" (positive) vs "terrible" (negative)

▪

Formality

•

Example: "greetings" (formal) vs "hey" (informal)

•

The Distrigutional Hypothesis (from linguistics)

→ Words apearing in similar contexts have similar meanings.

Counting Co-occurrences to Create Word Vec

For each word, count how many times it co-occurs with every other word

•

Define co-occurrence in therms of a sliding window of words

•

Store all counts in a big table

→ Result: Counts of each word with each possible neighbor word

Cosine Similarity

•

Dense vs. Sparse Vectors

Vectors by counting co-occurences are

Quite Long: too many dims to capture lexical semantics

Quite Sparse: Many entries will be 0

→ We need to find lower-dimensional dense vectors that capture word meaning

Word2Vec

•

Learning Dense word vectors

◦

Idea: vwv_wvw​ should help you predict which words co-occur with w

▪

Captures distribution of context words for w

•

Creating a Word2Vec Dataset

To create a dataset for Word2Vec, given a raw dataset of text, counting real co-occurrences within sliding window for positive examples and randomly sampling for negative ones.

◦

Why is it a fake supervised learning problem?

The Word2Vec model frames learning word vectors as a supervised learning task, but it's "fake" because we're not actually interested in predicting whether words co-occur—that's just a means to an end. The real goal is to learn useful word representations (embeddings) as a byproduct of training on this artificial task. In other words, the prediction task itself is not our true objective; it's simply a clever way to force the model to learn semantic relationships between words.

◦

How about Sampling Negatives?

◦

Baseline is sampling according to frequence of word p(w) in the data

◦

Improvement: Smapling according to α\alphaα-weighted frequency

p_\alpha(w) = \frac{f(w)^\alpha}{\sum_{w'} f(w')^\alpha}

where

f(w)

is the frequency of word

w

in the data, and

\alpha

is typically set to 0.75

Word2Vec Model

•

2 parameters 

Word vector for each word

•

features - the actual word vectors

Context vector for each wrod

•

classifier weights for task corresponding to w as context

•

Goal: vwv_wvw​ can be used by linear classifier to do any of the N was this a context word tasks

•

Objectives (looks just like logistic regression)

•

Training word2vec

◦

Using Gradient Descent same as logistic regression

Is this a convec problem? No

◦

We optimize weights and features at the same time!

Using Word Vectors

Solving Analogies

In vector space, resemples a parallelogram

•

Same relationship between apple and tree holds between grape and vine

For finding analogies, find word in vocabulary whose v is most similar to v

•

Common choice: Cosine Similarity

Bias in word vectors

•

word2vec doesn’y know what is a semantic relationsship and whia is a historical correlation

◦

e.g. “nurse” may co-occur more with “she” than “he” in abailable data but not a semantic relationship

Word Vectors and Classification

•

Can we use word vectors for text classification?

◦

Classifier expects a fixed-size feature vector, learns fixed number of weights

◦

Why do word vectors help?

◦

How do we deal with variable number of words per document?

Mean Pooling

•

Final result is smae dimension as a single word vector

•

Note: Each word in shorter sentence now implicitly has hiher “importance”

This means that when using mean pooling to average word vectors across a document, words in shorter sentences effectively contribute more to the final averaged vector compared to words in longer sentences. This is because each word vector is weighted equally in the averaging process, so in a sentence with fewer words, each individual word has a proportionally larger impact on the mean. For example, in a 3-word sentence, each word contributes 1/3 to the average, while in a 10-word sentence, each word only contributes 1/10.

•

Remaining Issues

Still a bag-of-words model (no word order, phrase information)

Polysemy: Words can have multiple possible meanings depending on context (e.g. bat)

Word sense disambiguation

•

This requires lokking at the broader sentence-level context

•

How to improve?

◦

Need to model interactions between words

◦

Need to model order of words

Conclusion

NLP - Lecture Summary

Wiki

Name

AI summary

Created

Keywords

Language

Last edited time

Published at

Section

Series

Status

Tags

Week

NLP Pipelines & Text Classification Methods

Open

NLP systems integrate feature extraction and machine learning, with deep learning replacing manual feature engineering. Text classification can be approached through generative methods like Naive Bayes or discriminative methods such as Logistic Regression and SVM. Naive Bayes assumes conditional independence of words given labels, while Logistic Regression optimizes parameters to maximize log-likelihood. Effective pre-processing, including tokenization and standardization, is crucial for model performance. Challenges like zero probabilities and numerical underflow in Naive Bayes can be addressed with Laplace smoothing and log space. Model evaluation is essential to ensure generalization to unseen data.

2026/02/20 21:51

NLP-260220-1351-LE-NLP

NLP

ENG

2026/02/21 01:39

2026/02/20

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-1

Word Vector

Open

2026/02/21 00:25

NLP-260220-1625-LE-WORDV

NLP

ENG

2026/02/23 07:38

2026/02/20

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-2-1

Neural Nets

Open

A practical overview of neural networks covers MLPs, activation functions, and CNNs. Key points include the importance of non-linear activation functions for learning complex patterns, the effectiveness of deeper networks over wider ones, and the role of SGD in optimization. Proper weight initialization and learning rate management are critical for training success, while CNNs leverage word embeddings for better generalization in feature learning.

2026/02/23 04:49

NLP-260222-2049-LE-NLP

NLP

ENG

2026/02/23 07:38

2026/02/22

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-2-2

Recurrent Nerual Networks for Sequence Labeling

Open

Recurrent Neural Networks (RNNs) are essential for sequence labeling tasks, addressing issues like context dependency and gradient problems through advanced architectures like LSTMs and Bidirectional RNNs. LSTMs improve long-term memory retention and mitigate the vanishing gradient issue, while combining LSTMs with CNNs and CRFs enhances performance in structured prediction tasks. Key techniques include POS tagging, the use of structured models to capture dependencies, and the application of Maximum Entropy Markov Models and Conditional Random Fields for effective labeling.

2026/02/23 07:41

NLP-260222-2341-LE-NLP

NLP

ENG

2026/02/23 22:36

2026/02/22

NLP & Prompt Enginnering

From the Bottom

Done

Lecture Summary

26-3-1

Seq2Seq & Neural Machine Translation

Open

2026/02/23 22:58

NLP-260223-1458-LE-NLP

NLP

ENG

2026/02/24 03:01

2026/02/23

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-3-2

Pytorch Basics

Open

2026/02/09 18:52

NLP-260209-1052-LE-NLP

NLP

ENG

2026/02/20 22:03

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-4

Transformer

Open

2026/02/20 22:03

NLP-260220-1403-LE-PYTOR

NLP

ENG

2026/02/20 22:03

NLP & Prompt Enginnering

From the Bottom

In progress

Lecture Summary

26-4