- The paper clarifies the mathematical derivation behind negative sampling in word2vec, enabling deeper understanding of word embedding models.
- It analyzes how the skip-gram model transforms the computationally expensive softmax into an efficient binary classification through noise sampling.
- The paper highlights practical considerations like dynamic window sizing and subsampling to enhance the quality of learned embeddings.
Negative Sampling in word2vec Explained
The paper "word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method" by Yoav Goldberg and Omer Levy provides an elucidation of the mathematical underpinnings behind the negative-sampling technique in word2vec as introduced by Mikolov et al. The paper aims to demystify the dense mathematical formulations and concepts introduced in Mikolov's influential work on word embeddings.
The Skip-Gram Model
The departure point for understanding negative-sampling starts with the skip-gram model. Within the skip-gram model, the task is to predict the context words given a target word in a large corpus. The goal is to maximize a conditional probability expressed as p(c∣w), where c is the context word and w is the target word. The parameters θ are tuned to maximize the likelihood of the observed word-context pairs.
Mathematically, the objective is:
argmaxθ∏(w,c)∈Dp(c∣w;θ)
where D is the set of all observed word-context pairs. The optimization can be rephrased in terms of the log-likelihood, transforming the product into a summation for computational efficiency:
argθmax(w,c)∈D∑logp(c∣w;θ)
The conditional probability p(c∣w;θ) is traditionally computed using the softmax function:
p(c∣w;θ)=∑c′∈Cevc′⋅vwevc⋅vw
where vc and vw are the vector representations for the context and word, respectively, and C is the set of all possible contexts.
Challenges and Negative Sampling
Calculating softmax is computationally expensive due to the normalization term, which requires summation over all possible contexts. One way to alleviate this is hierarchical softmax, but the focus of Goldberg and Levy's paper is on negative sampling, which is an alternative introduced by Mikolov et al.
The essence of negative sampling is to reformulate the objective to distinguish between observed word-context pairs (positive examples) and randomly generated word-context pairs that are not observed (negative examples). The aim is to maximize the log probability of distinguishing the observed pairs from the noise, which is computationally more feasible.
For a given word-context pair (w,c), negative sampling defines the probability of the pair being observed in the data:
p(D=1∣w,c;θ)=σ(vc⋅vw)
where σ(x) is the sigmoid function, σ(x)=1+e−x1. The goal is then to maximize:
argθmax(w,c)∈D∑logσ(vc⋅vw)+(w,c′)∈D′∑logσ(−vc′⋅vw)
In this setup, D′ is the set of negative examples generated by sampling. Mikolov et al. utilize a heuristic to sample negative contexts based on their frequency in the corpus, raised to the $3/4$ power.
Remarks and Practical Insights
Goldberg and Levy make several key observations:
- Redefinition of Objective: Negative sampling does not explicitly model p(c∣w) but rather an objective related to the joint distribution of w and c.
- Optimization Difficulty: While fixing word representations yields a convex logistic regression problem, jointly optimizing word and context representations renders the problem non-convex.
Context Definitions in word2vec
The authors also discuss the intricacies involved in defining contexts in the word2vec implementation, which includes:
- Dynamic Window Size: The context window size varies and is sampled dynamically.
- Subsampling and Pruning: Rare words are pruned, and frequent words are down-sampled, indirectly increasing the effective context window size.
Theoretical and Practical Implications
The central justification for negative sampling remains somewhat intuitive rather than formally proven. The distributional hypothesis forms the conceptual basis—words occurring in similar contexts tend to have similar meanings. This intuitive grounding suggests that the embeddings capture semantic similarities meaningfully.
Speculation on Future Developments
Considering advances in computational resources and optimization algorithms, future research may aim at refining negative sampling or exploring alternative methods to achieve more efficient and interpretable embeddings. There is also potential for investigating theoretical frameworks that provide a rigorous understanding of why and when negative sampling produces high-quality embeddings.
In conclusion, Goldberg and Levy's paper serves as an invaluable resource for demystifying the math and rationale behind negative sampling in word2vec, making it accessible for researchers aiming to explore word embedding methodologies.