Changyu Lee

Word Vector

Published at
2026/02/20
Last edited time
2026/02/23 07:38
Created
2026/02/21 00:25
Section
NLP & Prompt Enginnering
Status
Done
Series
From the Bottom
Tags
Lecture Summary
AI summary
Word vectors represent words as numerical vectors capturing semantic meaning, with similar words having similar vectors. Word2Vec learns these vectors by predicting co-occurrence patterns, optimizing both word and context vectors through gradient descent. Key applications include solving analogies, measuring similarity via cosine similarity, and converting variable-length documents into fixed-size vectors using mean pooling. Limitations include challenges with polysemy and biases in training data, which can reflect historical correlations rather than true semantic relationships.
Keywords
NLP
Language
ENG
Week
26-2-1
1 more property
Word vectors represent words as numerical vectors that capture semantic meaning
Similar words have similar vectors based on the distributional hypothesis—words appearing in similar contexts have similar meanings
Word2Vec learns dense word vectors by predicting co-occurrence patterns within sliding windows, using positive examples (real co-occurrences) and negative samples
The model optimizes both word vectors (features) and context vectors (classifier weights) simultaneously through gradient descent
Word vectors enable solving analogies through vector arithmetic (e.g., king - man + woman ≈ queen) and measuring similarity via cosine similarity
For text classification, variable-length documents can be converted to fixed-size vectors using mean pooling, though this loses word order and phrase information
Key limitations include inability to handle polysemy (multiple word meanings) and word sense disambiguation without broader sentence context
Word vectors can reflect biases present in training data, capturing historical correlations rather than purely semantic relationships

Word Vectors

Definition: A vector that represents word’s meaning
Similar words should have similar vectros
Diffrent components of the vector may represent different properties of a word
Word vectors can be used features for NLP models
To feed neural nets sentences, need to represent each word as a vector

Another View: A Word Vector Layer as a Function

A paramatric function from vocabulary words to vectors
Input: A word
Output: A vector of length d
Formula: Return word_vecs[w]
Parameters:
For each word, A word vector vwv_w of shape dd
Vd|V| * d total params needed

Lexical Semantics

Word vectors should capture lexical semantics (word-level meaning)
Synonymy or Antonymy
Example: "happy" vs "joyful" (synonyms), "hot" vs "cold" (antonyms)
Hypernymy / Hyponymy
Example: "animal" (hypernym) vs "dog" (hyponym)
Similarity
Example: "king" and "queen" are similar in meaning
Various features
Sentiment
Example: "excellent" (positive) vs "terrible" (negative)
Formality
Example: "greetings" (formal) vs "hey" (informal)
The Distrigutional Hypothesis (from linguistics)
→ Words apearing in similar contexts have similar meanings.

Counting Co-occurrences to Create Word Vec

For each word, count how many times it co-occurs with every other word
Define co-occurrence in therms of a sliding window of words
Store all counts in a big table
→ Result: Counts of each word with each possible neighbor word
Cosine Similarity
Dense vs. Sparse Vectors
Vectors by counting co-occurences are
1.
Quite Long: too many dims to capture lexical semantics
2.
Quite Sparse: Many entries will be 0
→ We need to find lower-dimensional dense vectors that capture word meaning

Word2Vec

Learning Dense word vectors
Idea: vwv_w should help you predict which words co-occur with w
Captures distribution of context words for w
Creating a Word2Vec Dataset
To create a dataset for Word2Vec, given a raw dataset of text, counting real co-occurrences within sliding window for positive examples and randomly sampling for negative ones.
Why is it a fake supervised learning problem?
The Word2Vec model frames learning word vectors as a supervised learning task, but it's "fake" because we're not actually interested in predicting whether words co-occur—that's just a means to an end. The real goal is to learn useful word representations (embeddings) as a byproduct of training on this artificial task. In other words, the prediction task itself is not our true objective; it's simply a clever way to force the model to learn semantic relationships between words.
How about Sampling Negatives?
Baseline is sampling according to frequence of word p(w) in the data
Improvement: Smapling according to α\alpha-weighted frequency
pα(w)=f(w)αwf(w)αp_\alpha(w) = \frac{f(w)^\alpha}{\sum_{w'} f(w')^\alpha}
where f(w)f(w) is the frequency of word ww in the data, and α\alpha is typically set to 0.75

Word2Vec Model

2 parameters
1.
Word vector for each word
features - the actual word vectors
2.
Context vector for each wrod
classifier weights for task corresponding to w as context
Goal: vwv_w can be used by linear classifier to do any of the N was this a context word tasks
Objectives (looks just like logistic regression)
Training word2vec
Using Gradient Descent same as logistic regression
Is this a convec problem? No
We optimize weights and features at the same time!

Using Word Vectors

Solving Analogies

In vector space, resemples a parallelogram
Same relationship between apple and tree holds between grape and vine
For finding analogies, find word in vocabulary whose v is most similar to v
Common choice: Cosine Similarity

Bias in word vectors

word2vec doesn’y know what is a semantic relationsship and whia is a historical correlation
e.g. “nurse” may co-occur more with “she” than “he” in abailable data but not a semantic relationship

Word Vectors and Classification

Can we use word vectors for text classification?
Classifier expects a fixed-size feature vector, learns fixed number of weights
Why do word vectors help?
How do we deal with variable number of words per document?
1.
Mean Pooling
Final result is smae dimension as a single word vector
Note: Each word in shorter sentence now implicitly has hiher “importance”
This means that when using mean pooling to average word vectors across a document, words in shorter sentences effectively contribute more to the final averaged vector compared to words in longer sentences. This is because each word vector is weighted equally in the averaging process, so in a sentence with fewer words, each individual word has a proportionally larger impact on the mean. For example, in a 3-word sentence, each word contributes 1/3 to the average, while in a 10-word sentence, each word only contributes 1/10.
Remaining Issues
1.
Still a bag-of-words model (no word order, phrase information)
2.
Polysemy: Words can have multiple possible meanings depending on context (e.g. bat)
3.
Word sense disambiguation
This requires lokking at the broader sentence-level context
How to improve?
Need to model interactions between words
Need to model order of words

Conclusion

NLP - Lecture Summary
Search
Wiki
Name
AI summary
Created
ID
Keywords
Language
Last edited time
Published at
Section
Series
Status
Tags
Week
NLP systems integrate feature extraction and machine learning, with deep learning replacing manual feature engineering. Text classification can be approached through generative methods like Naive Bayes or discriminative methods such as Logistic Regression and SVM. Naive Bayes assumes conditional independence of words given labels, while Logistic Regression optimizes parameters to maximize log-likelihood. Effective pre-processing, including tokenization and standardization, is crucial for model performance. Challenges like zero probabilities and numerical underflow in Naive Bayes can be addressed with Laplace smoothing and log space. Model evaluation is essential to ensure generalization to unseen data.
2026/02/20 21:51
NLP-260220-1351-LE-NLP
NLP
ENG
2026/02/21 01:39
2026/02/20
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-1
Word vectors represent words as numerical vectors capturing semantic meaning, with similar words having similar vectors. Word2Vec learns these vectors by predicting co-occurrence patterns, optimizing both word and context vectors through gradient descent. Key applications include solving analogies, measuring similarity via cosine similarity, and converting variable-length documents into fixed-size vectors using mean pooling. Limitations include challenges with polysemy and biases in training data, which can reflect historical correlations rather than true semantic relationships.
2026/02/21 00:25
NLP-260220-1625-LE-WORDV
NLP
ENG
2026/02/23 07:38
2026/02/20
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-2-1
A practical overview of neural networks covers MLPs, activation functions, and CNNs. Key points include the importance of non-linear activation functions for learning complex patterns, the effectiveness of deeper networks over wider ones, and the role of SGD in optimization. Proper weight initialization and learning rate management are critical for training success, while CNNs leverage word embeddings for better generalization in feature learning.
2026/02/23 04:49
NLP-260222-2049-LE-NLP
NLP
ENG
2026/02/23 07:38
2026/02/22
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-2-2
Recurrent Neural Networks (RNNs) are essential for sequence labeling tasks, addressing issues like context dependency and gradient problems through advanced architectures like LSTMs and Bidirectional RNNs. LSTMs improve long-term memory retention and mitigate the vanishing gradient issue, while combining LSTMs with CNNs and CRFs enhances performance in structured prediction tasks. Key techniques include POS tagging, the use of structured models to capture dependencies, and the application of Maximum Entropy Markov Models and Conditional Random Fields for effective labeling.
2026/02/23 07:41
NLP-260222-2341-LE-NLP
NLP
ENG
2026/02/23 22:36
2026/02/22
NLP & Prompt Enginnering
From the Bottom
Done
Lecture Summary
26-3-1
2026/02/23 22:58
NLP-260223-1458-LE-NLP
NLP
ENG
2026/02/24 03:01
2026/02/23
NLP & Prompt Enginnering
From the Bottom
In progress
Lecture Summary
26-3-2
2026/02/09 18:52
NLP-260209-1052-LE-NLP
NLP
ENG
2026/02/20 22:03
NLP & Prompt Enginnering
From the Bottom
In progress
Lecture Summary
26-4
2026/02/20 22:03
NLP-260220-1403-LE-PYTOR
NLP
ENG
2026/02/20 22:03
NLP & Prompt Enginnering
From the Bottom
In progress
Lecture Summary
26-4