•
Word vectors represent words as numerical vectors that capture semantic meaning
•
Similar words have similar vectors based on the distributional hypothesis—words appearing in similar contexts have similar meanings
•
Word2Vec learns dense word vectors by predicting co-occurrence patterns within sliding windows, using positive examples (real co-occurrences) and negative samples
•
The model optimizes both word vectors (features) and context vectors (classifier weights) simultaneously through gradient descent
•
Word vectors enable solving analogies through vector arithmetic (e.g., king - man + woman ≈ queen) and measuring similarity via cosine similarity
•
For text classification, variable-length documents can be converted to fixed-size vectors using mean pooling, though this loses word order and phrase information
•
Key limitations include inability to handle polysemy (multiple word meanings) and word sense disambiguation without broader sentence context
•
Word vectors can reflect biases present in training data, capturing historical correlations rather than purely semantic relationships
Word Vectors
•
Definition: A vector that represents word’s meaning
◦
Similar words should have similar vectros
◦
Diffrent components of the vector may represent different properties of a word
•
Word vectors can be used features for NLP models
•
To feed neural nets sentences, need to represent each word as a vector
Another View: A Word Vector Layer as a Function
•
A paramatric function from vocabulary words to vectors
•
Input: A word
•
Output: A vector of length d
•
Formula: Return word_vecs[w]
•
Parameters:
◦
For each word, A word vector of shape
◦
total params needed
Lexical Semantics
•
Word vectors should capture lexical semantics (word-level meaning)
◦
Synonymy or Antonymy
▪
Example: "happy" vs "joyful" (synonyms), "hot" vs "cold" (antonyms)
◦
Hypernymy / Hyponymy
▪
Example: "animal" (hypernym) vs "dog" (hyponym)
◦
Similarity
▪
Example: "king" and "queen" are similar in meaning
◦
Various features
▪
Sentiment
•
Example: "excellent" (positive) vs "terrible" (negative)
▪
Formality
•
Example: "greetings" (formal) vs "hey" (informal)
•
The Distrigutional Hypothesis (from linguistics)
→ Words apearing in similar contexts have similar meanings.
Counting Co-occurrences to Create Word Vec
For each word, count how many times it co-occurs with every other word
•
Define co-occurrence in therms of a sliding window of words
•
Store all counts in a big table
→ Result: Counts of each word with each possible neighbor word
Cosine Similarity
•
Dense vs. Sparse Vectors
Vectors by counting co-occurences are
1.
Quite Long: too many dims to capture lexical semantics
2.
Quite Sparse: Many entries will be 0
→ We need to find lower-dimensional dense vectors that capture word meaning
Word2Vec
•
Learning Dense word vectors
◦
Idea: should help you predict which words co-occur with w
▪
Captures distribution of context words for w
•
Creating a Word2Vec Dataset
To create a dataset for Word2Vec, given a raw dataset of text, counting real co-occurrences within sliding window for positive examples and randomly sampling for negative ones.
◦
Why is it a fake supervised learning problem?
The Word2Vec model frames learning word vectors as a supervised learning task, but it's "fake" because we're not actually interested in predicting whether words co-occur—that's just a means to an end. The real goal is to learn useful word representations (embeddings) as a byproduct of training on this artificial task. In other words, the prediction task itself is not our true objective; it's simply a clever way to force the model to learn semantic relationships between words.
◦
How about Sampling Negatives?
◦
Baseline is sampling according to frequence of word p(w) in the data
◦
Improvement: Smapling according to -weighted frequency
where is the frequency of word in the data, and is typically set to 0.75
Word2Vec Model
•
2 parameters
1.
Word vector for each word
•
features - the actual word vectors
2.
Context vector for each wrod
•
classifier weights for task corresponding to w as context
•
Goal: can be used by linear classifier to do any of the N was this a context word tasks
•
Objectives (looks just like logistic regression)
•
Training word2vec
◦
Using Gradient Descent same as logistic regression
Is this a convec problem? No
◦
We optimize weights and features at the same time!
Using Word Vectors
Solving Analogies
In vector space, resemples a parallelogram
•
Same relationship between apple and tree holds between grape and vine
For finding analogies, find word in vocabulary whose v is most similar to v
•
Common choice: Cosine Similarity
Bias in word vectors
•
word2vec doesn’y know what is a semantic relationsship and whia is a historical correlation
◦
e.g. “nurse” may co-occur more with “she” than “he” in abailable data but not a semantic relationship
Word Vectors and Classification
•
Can we use word vectors for text classification?
◦
Classifier expects a fixed-size feature vector, learns fixed number of weights
◦
Why do word vectors help?
◦
How do we deal with variable number of words per document?
1.
Mean Pooling
•
Final result is smae dimension as a single word vector
•
Note: Each word in shorter sentence now implicitly has hiher “importance”
This means that when using mean pooling to average word vectors across a document, words in shorter sentences effectively contribute more to the final averaged vector compared to words in longer sentences. This is because each word vector is weighted equally in the averaging process, so in a sentence with fewer words, each individual word has a proportionally larger impact on the mean. For example, in a 3-word sentence, each word contributes 1/3 to the average, while in a 10-word sentence, each word only contributes 1/10.
•
Remaining Issues
1.
Still a bag-of-words model (no word order, phrase information)
2.
Polysemy: Words can have multiple possible meanings depending on context (e.g. bat)
3.
Word sense disambiguation
•
This requires lokking at the broader sentence-level context
•
How to improve?
◦
Need to model interactions between words
◦
Need to model order of words
Conclusion
NLP - Lecture Summary
Search

















