word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method (1402.3722v1)

Published 15 Feb 2014 in cs.CL, cs.LG, and stat.ML

Abstract: The word2vec software of Tomas Mikolov and colleagues (https://code.google.com/p/word2vec/ ) has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations. This note is an attempt to explain equation (4) (negative sampling) in "Distributed Representations of Words and Phrases and their Compositionality" by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean.

Citations (1,568)

View on Semantic Scholar

Summary

The paper clarifies the mathematical derivation behind negative sampling in word2vec, enabling deeper understanding of word embedding models.
It analyzes how the skip-gram model transforms the computationally expensive softmax into an efficient binary classification through noise sampling.
The paper highlights practical considerations like dynamic window sizing and subsampling to enhance the quality of learned embeddings.

Negative Sampling in word2vec Explained

The paper "word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method" by Yoav Goldberg and Omer Levy provides an elucidation of the mathematical underpinnings behind the negative-sampling technique in word2vec as introduced by Mikolov et al. The paper aims to demystify the dense mathematical formulations and concepts introduced in Mikolov's influential work on word embeddings.

The Skip-Gram Model

The departure point for understanding negative-sampling starts with the skip-gram model. Within the skip-gram model, the task is to predict the context words given a target word in a large corpus. The goal is to maximize a conditional probability expressed as $p(c|w)$ , where $c$ is the context word and $w$ is the target word. The parameters $\theta$ are tuned to maximize the likelihood of the observed word-context pairs.

Mathematically, the objective is:

$\arg\max_\theta \prod_{(w, c) \in D} p(c|w;\theta)$

where $D$ is the set of all observed word-context pairs. The optimization can be rephrased in terms of the log-likelihood, transforming the product into a summation for computational efficiency:

$\arg\max_\theta \sum_{(w, c) \in D} \log p(c|w;\theta)$

The conditional probability $p(c|w; \theta)$ is traditionally computed using the softmax function:

$p(c|w; \theta) = \frac{e^{v_c \cdot v_w}}{\sum_{c' \in C} e^{v_{c'} \cdot v_w}}$

where $v_c$ and $v_w$ are the vector representations for the context and word, respectively, and $C$ is the set of all possible contexts.

Challenges and Negative Sampling

Calculating softmax is computationally expensive due to the normalization term, which requires summation over all possible contexts. One way to alleviate this is hierarchical softmax, but the focus of Goldberg and Levy's paper is on negative sampling, which is an alternative introduced by Mikolov et al.

The essence of negative sampling is to reformulate the objective to distinguish between observed word-context pairs (positive examples) and randomly generated word-context pairs that are not observed (negative examples). The aim is to maximize the log probability of distinguishing the observed pairs from the noise, which is computationally more feasible.

For a given word-context pair $(w, c)$ , negative sampling defines the probability of the pair being observed in the data:

$p(D=1|w,c;\theta) = \sigma(v_c \cdot v_w)$

where $\sigma(x)$ is the sigmoid function, $\sigma(x) = \frac{1}{1+e^{-x}}$ . The goal is then to maximize:

$\arg\max_\theta \sum_{(w, c) \in D} \log \sigma(v_c \cdot v_w) + \sum_{(w, c') \in D'} \log \sigma(-v_c' \cdot v_w)$

In this setup, $D'$ is the set of negative examples generated by sampling. Mikolov et al. utilize a heuristic to sample negative contexts based on their frequency in the corpus, raised to the $3/4$ power.

Remarks and Practical Insights

Goldberg and Levy make several key observations:

Redefinition of Objective: Negative sampling does not explicitly model $p(c|w)$ but rather an objective related to the joint distribution of $w$ and $c$ .
Optimization Difficulty: While fixing word representations yields a convex logistic regression problem, jointly optimizing word and context representations renders the problem non-convex.

Context Definitions in word2vec

The authors also discuss the intricacies involved in defining contexts in the word2vec implementation, which includes:

Dynamic Window Size: The context window size varies and is sampled dynamically.
Subsampling and Pruning: Rare words are pruned, and frequent words are down-sampled, indirectly increasing the effective context window size.

Theoretical and Practical Implications

The central justification for negative sampling remains somewhat intuitive rather than formally proven. The distributional hypothesis forms the conceptual basis—words occurring in similar contexts tend to have similar meanings. This intuitive grounding suggests that the embeddings capture semantic similarities meaningfully.

Speculation on Future Developments

Considering advances in computational resources and optimization algorithms, future research may aim at refining negative sampling or exploring alternative methods to achieve more efficient and interpretable embeddings. There is also potential for investigating theoretical frameworks that provide a rigorous understanding of why and when negative sampling produces high-quality embeddings.

In conclusion, Goldberg and Levy's paper serves as an invaluable resource for demystifying the math and rationale behind negative sampling in word2vec, making it accessible for researchers aiming to explore word embedding methodologies.

PDF Markdown