CS224n：Natural Language Precessing with Deep Learning

Lecture Notes:Part 1

Word Vectors 1：Introduction, SVD, and Word2Vec

Authors: Francois Chaubard etc.

We began with a general discussion of what is NLP.

What is so special about human (natural) language? Human language is a system specifically constructed to convey meaning, and is not produced by a physical manifestation(物理形式的外在表现) of any kind. In that way, it is very different from vision or any other machine learning task.

Most words are just symbols for an extra-linguistic entity（超语言学的实体）: the word is a signifier that maps to a signified idea or thing. (所有词汇仅仅是特定含义或物体的表示符号)

For instance, the words “rocket” refers to the concept of a rocket, and by extension can designate an instance of a rocket. There are some exceptions, when we use words and letters for expressive signaling, like in “Whooompaa”. On top of this, the symbols of language can be encoded in several modalities: voice, gesture, writing, etc that are transmitted via continus signals to the brain, which itself appears to encode things in a continuous manner.

There are several different levels of tasks in NLP, form speech processing to speech processing to semantic interpretation and discourse processing. The goal of NLP is to be able to design algorithms to allow computers to “understand” natural language in order to perform some task. Example tasks come in varying level of difficulty.

**Easy**:

- Spell Checking

- Keyword Search

- Finding Synonyms

**Medium**:

- Parsing information form websites, documents, etc

**Hard:**

- Machine Translation

- Semantic Analysis (what is the meaning of query statement?

- Coreference (e.g. What does “he” or “it” refer to given a document?)

- Question Answering

The first and arguable most important common denominator across all NLP tasks is how we represent words as input to any of our models. Much of the earlier NLP work that we will no cover treats word as atomic symbols. To perform well on most NLP tasks we first need to have some notion of similarity and difference between words. With word vectors, we can quite easily encode this ability in the vectors themselves.

There are an estimated 13 million tokens for English language but are they all completely unrelated? Feline(猫科) to cat, hotel to motel? I think not. Thus, we want to encode words tokens each in some vector that represents a point in some of “word” space (such that N << 13 million) that is sufficient to encode all semantics of our languages. Each dimension would encode some meaning that we transfer using speech. For instance, semantic dimensions might indicate tense (past vs. present vs. future), count (singular vs. plural), and gender (masculine vs. feminine).

So let’s dive into our first word vector and arguably the most simple, the one-hot vector: Represent every word as an $R_{∣V∣×1}$ vector with all 0s and one 1 at the index of that word in the sorted English language. In this notation, $∣V∣$ is the size of our vocabulary. Word vectors in this type of encoding would appears as the following:

We represent each word as a completely independent entity. As we previously discussed, this word representation does not give us any directly an notion of similarity. So maybe wo can reduce the size of this space for $R_{∣V∣}$ to something smaller and thus find a subspace that encodes the relationships between words.

For this class of methods to find word embeddings, we first loop over a massive dataset and accumulate word co-occurrence counts in some form of a matrix X, and then perform Singular Value Decomposition on X to get a $USV_{T}$ decomposition. We then use the rows of U as the word embeddings for all words in our dictionary. Let us discuss a few choices of X.

As our first attempt, we make the bold conjecture that words that are related will often appears in the same documents. For instance, “banks”, “bonds”, “stocks”, “money”, etc. are probably likely to appear together. But “banks”, “octopus”(章鱼), “banana”, and “hockey”（曲棍球） would probably bot consistently appear together. We use this fact to build a word-document matrix, X in the following manner: Loop over billions of document and for each time the word $i$ appears in document $j$, we add one to entry $X_{ij}$. This is obviously a very large matrix and it scales with the number of documents. So perhaps wo can try something better.

The same kind of logic applies here however, the matrix $X$ stores co-occurrences of words thereby becoming an affinity matrix(亲和矩阵). In this method wo count the number of times each word appears inside a window of a particular size around the word of interest. We calculate this count for all the words in corpus. We display an example below. Let our corpus contain just three sentences and the window size be 1.

- I enjoy flying.
- I like NLP.
- I like deep learning.

The resulting counts matrix will then be:

We now perform SVD on $X$, observe the singular values(the diagonal entries in the resulting $S$ matrix, and cut them off at some index $k$ based on the desired percentage variance captured:$∑_{i=1}σ_{i}∑_{i=1}σ_{i} $. We then take the submatrix of $U_{1:∣V∣,1:K}$ to be our embedding matrix. This would thus give us a $k$-dimensional representation of every word in the vocabulary.

Applying SVD to $X$:

Reducing dimensionality by selecting first $k$ singular vectors:

Both of these methods gives us word vectors that are more than sufficient to encode semantic and syntactic information but are associated with many other problems:

- The dimensions of the matrix change very often (new words are added and corpus changes in size)
- The matrix is extremely sparse since most words do not co-occur.
- The matrix is high dimensional in general.
- Quadratic(二次的) cost to train.
- Requires the incorporation of some hacks on $X$ to account for the drastic imbalance in word frequency.

Some solutions to exist to resolve some of the issues discussed above: - Ignore function words such as “the”, “he”, ”has”, etc.
- Apply a ramp window- i.e. weight the co-occurrence count based of distance between the words in document.
- Use Pearson correlation and set negative counts to o instead of using just raw count.

As we can see in the next section, iteration based methods solve many of these issues in a far more elegant manner.

Let’s step back and try a new approach. Instead of computing and storing global information about some huge dataset (which might be billions of sentences), we can try to create a model that will be able to learn one iteration at a time and eventually be able to encode the probability of a word given its context.

The idea is to design a model whose parameters are the word vectors. Then, train the model on a certain objective. At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. Thus, we learn our word vectors. This idea is a very old one dating back to 1986. We call this method “backpropagating” the errors. The simpler model and the task, the faster it will be to train it.

Several approaches have been tested. Collobert design models for NLP whose first step so to transform each word in a vector. For each special task they train not only the model’s parameters but also the vectors and achieve great performance, while computing good word vectors!

In this class, we will present a simpler, more recent, probabilistic method by Mikolov: word2vec. Word2vec is a software package that actually includes:

- 2 algorithms: continuous bag-of-words (CBOW) and skip-gram. CBOW aims to predict a center word form the context in terms if word vectors. Skip-gram does the opposite, and predicts the distribution (probability) of context word from a center word.
- 2 training methods: negative sampling and hierarchical(分层级地) softmax. Negative sampling defines an objective by sampling negative examples, while hierarchical softmax defines an objective using an efficient trees structure to compute probabilities for all the vocabulary.

First, we need to create such a model that will assign a probability to a sequence of tokens, Let’s start with an example:

A good language model will give this sentence a high probability because this is a completely valid sentence, syntactically and semantically. Similarly, the sentence “stock boil fish is toy” should have a very low probability because it makes no sense. Mathematically, we can call this probability on any given sequence of $n$ words: $P(w_{1},w_{2},⋅,w_{n})$ We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent: $P(w_{1},w_{2},⋅,w_{n})=i=1∏ nP(w_{i})$ However, we know this is a bit ludicrous(荒唐的) because we know the nest word is highly contingent upon the previous sequence of words. And the silly sentence example might actually score highly. So perhaps we let the probability of the sequence depend of the pariwise(成对的) probability of a word in the sequence and the word next to it. We call this the bigram model and represent it as: $P(w_{1},w_{2},⋅,w_{n})=i=2∏ nP(w_{i}∣w_{i−1})$ Again this is certainly a bit naïve since we only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence, but as we will see, this representation gets us pretty far along. Note in the Word-Word Matrix with a context of size 1, we basically can learn these pairwise probabilities. But again, this would require computing and storing global information about a massive dataset.

Now that we understand how wo can think about a sequence of tokens having a probability, let us observe some example models that could learn these probabilities.

One approach is to treat {“The”, “cat”, “over”, “the”, “puddle”} as a context form these words, be able to predict or generate the center word “jumped”. This type of model we call a Continuous Bag of Words Model.

Let’s discuss the CBOW Model above in greater detail. First, we set up our known parameters. Let the known parameters in our model be the sentence represented by one-hot word vectors. The input one hot vectors or context we will represent with an $x_{x}$. And the output as $y_{c}$ in the CBOW Model, since wo only have one output, so we just call this $y$ which is the one hot vector of the known center word. Now let’s define our unknowns in our model.

We create two matrix, $v∈R_{n×∣V∣}$ and $u∈R_{∣V∣×n}$. Where $n$ is an arbitrary size which defines the size of our embedding space. $V$ is the input word matrix such that the $i$-th column of $V$ is the n-dimensional embedded vector for word $w_{i}$ when it is an input to this model. We denote this $n×1$ vector as $v_{i}$. Similarly, $U$ is the output word matrix. The $j$-th row of $U$ is an n-dimensional embedded vector for word matrix $w_{j}$ when it is an output of the model. We denote this row of $U$ as $U_{j}$. Note that wo do in fact learn two vetors for every word $w_{i}$ (i.e. input word vector $v_{i}$ and out put vector $u_{i}$)

We breakdown the way this model works in there steps:

- Generate one-hot vectors for the input context of size $m:(x_{(c−m)},⋯,x_{(c−1)},x_{(c+1)},⋯,x_{(c+m)})$.
- Get embedded word vectors for the context $v_{(c−m)}=Vx_{(c−m)},v_{(c−m+1)}=Vx_{(c−m+1)},⋯,v_{(c+m)}=Vx_{(c+m)}$
- Average these vectors to get $v^=2mv_{(c−m)}+v_{(c−m+1)}+⋯+v_{(c+m)} ∈R_{n}$
- Generate a score vector $z=Uv^∈R_{∣V∣}$. As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
- Turn the scores into probability $y^ =softmax(z)∈R_{∣V∣}$.
- We desire our probability generated, $y^ ∈R_{∣V∣}$, to match the true probabilities, $y∈R_{∣V∣}$, which also happens to be one hot vector of the actual word.

So now that we have an understanding of how our model would work if we had a $V$ and $U$, how would we learn these two matrices? Well, we need to create an objective function. Very often when we are trying to learn a probability from some true probability, we look to information theory to give us a measure of distance between two distributions. Here, we use a popular choice of distance/loss measure, cross entropy $H(y^ ,y)$.

The intuition for the use of cross-entropy in the discrete case can be derived from the formulation of the loss function:$H(y^ ,y)=−j=1∑∣V∣ y_{j}log(y^ _{j})$.Let us concern ourselves with the case at hand, which is that $y$ is one-hot vector. Thus we know that the above loss simplifies to simply: $H(y^ ,y)=−y_{j}log(y^ _{j})$ In this formulation, $c$ is the index where the correct word’s one hot vector is 1. We can now consider the case where our prediction was perfect and thus $y^ _{c}=1$ We can then calculate $H(y^ ,y)=−y_{j}log(y^ _{j})=0$. Thus for a perfect prediction, we face no penalty or loss. Now let us consider the opposite case where our prediction was very bad and thus $y^ _{c}=0.01$. As before, we calculate our loss to be $H(y^ ,y)=−1(0^.01)=4.605$. We can thus see that for probability distributions, cross entropy provides us with a good measure of distance. We thus formulate our optimization objective as:$minimizeJ =−logP(w_{c}∣w_{c−m},⋯,w_{c−1},w_{c+1},⋯,w_{c+m})=−logP(u_{c}∣v^)=−log∑_{j=1}exp(u_{j}v^)exp(u_{c}v^) =−u_{c}v^+logj=1∑∣V∣ exp(u_{j}v^) $We use stochastic gradient descent to update all relevant word vectors $u_{c}$ and $v_{j}$.

Another approach is to create a model such that given the center word “jumped”, the model will be able to predict or generate the surrounding words “The”, “cat”, “over”, “the”, “puddle”. Here we call the word “jumped” the context. We call this type of model a Skip-Gram model.

Let’s discuss the Skip-Gram model above. The setup is largely the same but we essentially swap our $x$ and $y$ i.e. $x$ in the CBOW are now $y$ and vice-versa. The input one hot vector we will represent with an $x$ (since there is only one). And the out put vectors as $y_{(i)}$. We define $V$ and $U$ the same as in CBOW.

We breakdown the way this model works in these 6 steps:

- We generate our one hot input vector $x∈R_{∣V∣}$ of the center word.
- Generate embedded word vector for the center word $v_{c}=Vx∈R_{n}$
- Generate a score vector $z=Uv_{c}$
- Turn the score vector into probabilities, $y^ =softmax(z)$. Note that $y^ _{c−m},⋯,y^ _{c−1},y^ _{c+1},⋯,y^ _{c+m}$ are the probabilities of observing each context word.
- We desire our probability vector generated to match the true probabilities which is $y_{(c−m)},⋯,y_{(c−1)},y_{(c+1},⋯,y_{(c+m)}$, the one hot vectors of the actual output.

As in the CBOW, we need to generate an objective function for us to evaluate the model. A key difference here is that we invoke a Naïve Bayes assumption to break out the probabilities. If you have not seen this before, then simply put, it is a strong conditional independence as assumption. In other words, given the center word, all output words are completely independent.

$minimizeJ =−logP(w_{c−m},⋯,w_{c−1},w_{c+1},⋯,w_{c+m}∣w_{c})=−logj=0,j̸ =m∏2m P(w_{c−m+j}∣w_{c})=−logj=0,j̸ =m∏2m P(w_{c−m+j}∣v_{c})=−logj=0,j̸ =m∏2m ∑_{k=1}exp(u_{k}v_{c})exp(u_{c−m+1}v_{c}) =−j=0,j̸ =m∑2m u_{c−m+j}v_{c}+2mlogk=1∑∣V∣ exp(u_{k}v_{c}) $

With this objective function, we can compute the gradients with respect to known parameters and at each iteration update them via Stochastic Gradient Descent.

Note that $J =−j=0,j̸ =m∑2m logP(u_{c−m+j}∣v_{c})=j=0,j̸ =m∑2m H(y^ ,y_{c−m+j}) $where $H(y^ ,y_{c−m+j})$ is the cross-entropy between the probability vector $y^ $ and the one-hot vector $y_{c−m+j}$.

Let’s take a second to look at the objective function. Note that the summation over $∣V∣$ is computationally huge! Any update we do or evaluation of the objective function would take $O(∣V∣)$ time which if we recall is in the millions. A simple idea is we could instead just approximate it.

For every training step, instead of looping over the entire vocabulary, we can just sample several negative samples! We “sample” from a noise distribution $P_{n}(w)$ whose probabilities match the ordering of the frequency of the vocabulary. To augment our formulation of the problem to incorporate Negative Sampling, all we need to do is update the:

Objective function

Gradients

Updates rules

While negative sampling is based on the Skip-Gram model, it is in fact optimizing a different objective. Consider a pair $(w,c)$ of word and the context. Did this pair come from the training data? Let’s denote by $P(D=1∣w,c)$ the probability that $(w,c)$ came from the corpus data. Correspondingly, $P(D=0∣w,c)$ will be the probability that $(w,c)$ did not come from the corpus data. First, let’s model $P(D=1∣w,c)$ with the sigmoid function:$P(D=1∣w,c,θ)=σ(v_{c}v_{w})=1+e_{(−v_{c}v_{w})}1 $.Now we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is not. We take a simple maximun likelihood approach of these two probabilities. Here we take $θ$ to be the parameters of the model, and in our case it is $V$ and $U$.

Note that maximizing the likelihood is the same as minimizing the negative log likelihood$J=−(w,c)∈D∑ log1+exp(−u_{w}v_{c})1 −∑log1+exp(u_{w}v_{c})1 $Note that $D^$ is a “false” or “negative” corpus. Where we would have sentences like "Stock boil fish is a toy”. Unnatural sentences that should get a low probability of ever occurring. We can generate $D^$ on the fly by randomly sampling this negative from the word bank.

For skip-gram, our new objective function for observing the context word $c−m+j$ given the enter word c would be $−logσ(u_{c−m+j}⋅v_{c})−k=1∑K logσ(−u^_{K}⋅v_{c})$For CBOW, our new objective function for observing the center word $u_{c}$ given the context vector $v^=2mv_{c−m}+v_{c−m+1}+⋯+v_{c+m} $ would be $−logσ(u_{c}⋅v^)−k=1∑K logσ(−u^_{K}⋅v_{c})$ In the above formulation, KaTeX parse error: Expected 'EOF', got '}' at position 24: …_K|k=1 \cdots K}̲ are sampled form $P_{n}(w)$. Let’s discuss what $P_{n}(w)$ should be. While there is much discussion of what makes the best approximation, what seems to work best is the Unigram Model raised to the power of 3/4. Why 3/4? Here is an example that might help gain some intuition:

is: $0.9_{3/4}=0.92$

Constitution: $0.09_{3/4}=0.16$

Bombastic: $0.01_{3/4}=0.032$

“Bombastic” is now 3x more likely to be sampled while “is” only went up marginally.

Let’s introduce some notation. Let $L(w)$ be the number of nodes in the path from the root to the leaf $w$. For instance, $L(w_{2})$ in Figure 4 is 3. Let’s write $n(w,i)$ as the $i$-th node in this path with associated vector $v_{n(w,i)}. So $n(w,1)$ is the root, while $n(w, L(w)) is the father of $w$. Now for each inner node $n$, we arbitrarily choose one of its children and call it $ch(n)$ (e.g. always the left node). Then, we can compute the probability as $P(w∣w_{i})=i=1∏L(w)−1 σ([n(w,j+1)=ch(n(w,j))]⋅v_{n(w,j)}v_{w_{i}})$where

MIKOLOV ET AL. also present hierarchical softmax as a much more efficient alternative to the normal softmax. In practice, hierarchical softmax tends to be better for infrequent words, while negative sampling works better for frequent words and lower dimensional vectors.

Hierarchical softmax uses a binary tree to represent all words in the vocabulary. Each leaf if the tree is a word, and there is a unique path from root to leaf. In this model, there is no output representation for words. Instead, each node of the graph (expect the root and the leaves) is associated to a vector that the model is going to learn.

In this model, the probability of a word w given a vector $w_{i}$, $P(w∣w_{i})$, is equal to the probability of a random walk starting in the root and ending in the leaf node corresponding to $w$. The main advantage in computing the probability this way is that the cost is only O(log(|V|)), corresponding to the length of the path.

And $σ(⋅)$ is the sigmoid function.

This formula is fairly dense, so let’s examine it more closely. First we computing a product of terms based on the shape of the path from the root $(n(w,1))$ to the leaf $(w)$. If we assume $ch(n)$ is always the left node $n$, if we sum the probabilities for going to the left and right node, you can check that for any value of $v_{n}v_{w_{i}}$, $σ(v_{n}v_{w_{i}})+σ(−v_{n}v_{w_{i}})=1$

The normalization also ensures that $∑_{w=1}P(w∣w_{i})=1$, just as in the original softmax.

Finally, we compare the simllarity os our input vector $v_{w_{i}}$ to each inner node vector $v_{n(w,j)}$ using a dot product. Let’s run through an example. Taking $w_{2}$ in Figure 4, we must take two left edges and then a right edge to reach $w_{2}$ from the root, so

To train the model, our goal is still to minimize the negative log likelihood – log$P(w∣w_{i})$. But instead of updating output vectors per word, we update the vectors of the nodes in the binary tree that are in the path from root to leaf node.

The speed of this method is determined by the way in which the binary tree is constructed and words are assigned to leaf nodes. MIKOLOV ET AL. use a binary Huffman tree, which assign frequent words shorter paths in the tree.