3 min read
Started with NLP a while back, and I will keep on updating this as and when I learn more or correct some misunderstandings:
There are inherently two types of word embeddings:
1) Prediction based word embeddings
2) Count based word embeddings
Word2vec being an example of the prediction based. GloVe on the other hand is a count based word embedding model which utilizes the co-occurrence matrices to do the same tasks.
Rough order:
Select Target Word – Pick the first word of the sentence as the target word.
It initially selects the first word of the sentence (Target word) and makes a word vector matrix for the same of dimension d, where d can be (100,200or 500 depending on how many semantic relations we want to capture of the words).
At the same time it looks the words in the window = k words, and makes the word vectors for them (context words). After 1 iteration the dot product of both is calculated to find the "Score", this "Score" value is passed into a softmax function to convert this into a probability distribution containing values(negative sampling here reduces the computing cost and time significantly),after which the value of probability (skip gram ), p(context word|target word) is also calculated and the goal is to maximize this probability such as to help the model predict better results for context words given a target word. The probability is maximized so as to maximize the value of the score or similarity between words or minimize depending on the context. This probability and score updation happens through updating the given word vectors during multiple iterations(a given target word would be the context word in the next iteration). The updates take place using stochastic gradient descent (normal gradient descent cannot be used here due to the immense size of words and data, updating each word multiple times like that would lead to a lot of time consumption, therefore stochastic gradient descent helps us as it updates the values by calculating the gradient for each individual sample unlike normal GD which takes the whole sample to find the gradient and perform the descent).Eventually the score and probability saturate after each word has 2 word vector matrices(1 as target and 1 as context). This is how the model is trained to predict the context words from target words(skip gram) or target words from context words (CBOW model)