Word Embeddings

3 min read

Started with NLP a while back, and I will keep on updating this as and when I learn more or correct some misunderstandings:

There are inherently two types of word embeddings:
1) Prediction based word embeddings
2) Count based word embeddings

Word2vec being an example of the prediction based. GloVe on the other hand is a count based word embedding model which utilizes the co-occurrence matrices to do the same tasks.

WORD2VEC :

Rough order:


Select Target Word – Pick the first word of the sentence as the target word.

  1. Identify Context Words – Choose words within a window size around the target word.
  2. Compute Word Vectors – Generate word vector representations for both the target and context words.
  3. Calculate Score – Compute the dot product of the target and context word vectors to get a score.
  4. Apply Softmax – Convert the score into a probability distribution using the softmax function.
  5. Compute Probability – Calculate and maximize it.
  6. Update Word Vectors – Adjust word vectors using stochastic gradient descent (SGD) to improve predictions.
  7. Repeat Iterations – Continue the process for all words in the sentence, treating each as a target word.
  8. Train Model – After multiple iterations, the model learns meaningful word representations for predictions.

It initially selects the first word of the sentence (Target word) and makes a word vector matrix for the same of dimension d, where d can be (100,200or 500 depending on how many semantic relations we want to capture of the words).
At the same time it looks the words in the window = k words, and makes the word vectors for them (context words). After 1 iteration the dot product of both is calculated to find the "Score", this "Score" value is passed into a softmax function to convert this into a probability distribution containing values(negative sampling here reduces the computing cost and time significantly),after which the value of probability (skip gram ), p(context word|target word) is also calculated and the goal is to maximize this probability such as to help the model predict better results for context words given a target word. The probability is maximized so as to maximize the value of the score or similarity between words or minimize depending on the context. This probability and score updation happens through updating the given word vectors during multiple iterations(a given target word would be the context word in the next iteration). The updates take place using stochastic gradient descent (normal gradient descent cannot be used here due to the immense size of words and data, updating each word multiple times like that would lead to a lot of time consumption, therefore stochastic gradient descent helps us as it updates the values by calculating the gradient for each individual sample unlike normal GD which takes the whole sample to find the gradient and perform the descent).Eventually the score and probability saturate after each word has 2 word vector matrices(1 as target and 1 as context). This is how the model is trained to predict the context words from target words(skip gram) or target words from context words (CBOW model)