word2vec_deeper_understanding

Chapter 1. Fundamentals

In continuous word representation, we actually would like to acquire a dimension-reduced vector for each word. Then these vectors could also be used as input for a neural network. It performs like a recursive way. We concatenate the representations of each word to represent a window, namely, $$k$$ words $$[ w_1,\dots, w_k]$$ and get:

            $$ x = [c(w_1)^T,...,C(w_k)^T]$$

We learn these representations by gradient descent (first order compared with Newton's methods). The neural network parameters and each representation $$C(w)$$ within a gradient step:

            $$C(w) \leftarrow C(w) - \alpha \Delta_{C(w)} l$$