Skip-gram with Negative Sampling
- Skip-gram with Negative Sampling is a neural model that learns distributed representations of words and nodes by maximizing observed co-occurrence and repelling negative pairs.
- It optimizes a stochastic binary classification objective that implicitly factorizes a shifted PMI matrix, enhancing semantic regularities and downstream task performance.
- SGNS forms the basis for models like word2vec and graph embeddings, supporting efficient incremental training and applications in NLP and network science.
Skip-gram with Negative Sampling (SGNS) is a neural probabilistic model for learning distributed representations of words and nodes, and a foundational technique for unsupervised representation learning across natural language processing and network science. SGNS forms the core of word2vec and its derivatives, as well as graph embedding algorithms such as node2vec, LINE, and DeepWalk. The defining principle of SGNS is to maximize, for each observed (target, context) pair, the probability of co-occurrence under a shallow neural model, while repelling representations of randomly sampled “negative” context pairs to avoid trivial solutions. Empirically, SGNS embeddings preserve semantic regularities, encode higher-order co-occurrence structure, and serve as a universal substrate for downstream similarity and classification tasks.
1. Mathematical Formulation and Objective
Let be a vocabulary of items (e.g., words), and a collection of observed pairs from a training corpus. SGNS learns two sets of -dimensional embeddings, (“input”/“target”) and (“context”/“output”), via the following stochastic binary classification objective:
where is the sigmoid nonlinearity, is a noise distribution (typically a smoothed or empirical unigram), and 0 is the number of negative samples per positive pair (Shi et al., 2014, Landgraf et al., 2017). The global objective, summed over all positive pairs, is maximized via stochastic gradient ascent.
Summing over all word–context pairs and writing the empirical distribution as 1, the corpus-level loss is: 2 This loss decomposes into an attraction term for observed co-occurrences and a repulsion term for negative (random) samples (Assylbekov et al., 2020, Shi et al., 2014).
2. Theoretical Properties and PMI Factorization
A key theoretical insight is that SGNS implicitly factorizes a shifted Pointwise Mutual Information (PMI) matrix. Under ideal conditions (3), the optimal inner product satisfies: 4 where
5
This relation establishes SGNS as a stochastic approximation to low-rank matrix factorization of the shifted PMI, unifying it with explicit count-model approaches such as GloVe and SPPMI-SVD (Shi et al., 2014, Landgraf et al., 2017, Assylbekov et al., 2020).
From the weighted logistic PCA perspective, SGNS can be viewed as fitting a low-rank log-odds model to a matrix of binomial proportions constructed from observed and negatively sampled pairs—each 6 cell accumulates 7 trials, with “successes” 8, yielding (Landgraf et al., 2017): 9
3. Optimization, Incrementality, and Regularization
SGNS is universally trained using stochastic gradient descent (SGD) due to the potentially vast number of unique 0 pairs and negatives (Shi et al., 2014, Kaji et al., 2017, Peng et al., 2019).
Incremental SGNS: Recent works have developed “single-pass” SGNS variants for streaming or dynamic data, where the noise distribution 1 is adapted incrementally based on observed frequencies (Kaji et al., 2017, Peng et al., 2019). Theoretical bounds confirm that, under mild assumptions, incremental SGNS achieves almost the same objective value and embedding quality as conventional multi-pass algorithms, with substantial efficiency gains.
Regularization: The SGNS objective is invariant under invertible linear transformations of the embeddings, introducing ambiguity in geometry absent further constraints. Quadratic (Frobenius-norm) regularization removes all but orthogonal ambiguities, yielding uniqueness up to rotation and improving analogy accuracy at higher embedding dimensions (Mu et al., 2018).
Riemannian Optimization: Advanced algorithms such as RO-SGNS optimize directly over the Riemannian manifold of fixed-rank matrices, providing more stable solutions with higher SGNS objective values and improved word similarity results, especially for large embedding dimensions (Fonarev et al., 2017).
4. Noise Distributions and Negative Sampling
The behavior of SGNS is fundamentally governed by the negative sampling distribution 2 (Wang et al., 2020, Caselles-Dupré et al., 2018, Chen et al., 2017). The most common choice is an “3-smoothed” unigram with exponent 4 in 5 (with 6 standard for words, but task-specific in recommendation and graph embedding):
7
Key findings include:
- Uniform 8 leads to slower convergence and poor embeddings.
- Empirical unigram 9 quickly overfits frequent words.
- 0-power smoothing offers a balance between informativeness and variance.
- Adaptive and context-conditional noise distributions (caSGN) further improve performance by selectively targeting “hard” negatives (Wang et al., 2020, Chen et al., 2017).
In the batch setting, negative sampling efficiently approximates the expensive full softmax normalization (Shi et al., 2014).
5. Generalizations and Interpretations
Matrix and Tensor Factorization: SGNS extends naturally to higher-order data: for time-evolving graphs or multi-relational data, SGNS is generalized to multidimensional tensor factorization, either as explicit Canonical-Polyadic (CP) decompositions or as implicit factorization via higher-order skip-gram objectives (Piaggesi et al., 2020).
Information-Theoretic View: At optimality, squared Euclidean distance in SGNS embeddings is a monotonic function of 1 (co-occurrence shifted PMI), directly connecting geometric distance to likelihood-based similarity (Ethayarajh et al., 2018).
Graph Embedding: In networks, SGNS serves as a flexible primitive, fitting node embeddings using co-occurrence statistics from random walks or edge lists, with the SGNS negative term approximating a dimension-level re-centering operator (Liu et al., 2024). Dimension regularization can, in specific regimes, replace explicit negative sampling, yielding substantial computational savings while preserving downstream accuracy.
6. Empirical Properties and Applications
Empirically, SGNS-derived embeddings perform robustly across disparate domains:
- Word Embeddings: High-quality results on analogy, similarity, and analogy transfer tasks. Captures both first- and second-order similarity structure, similar to low-rank SVD but more scalable (Schlechtweg et al., 2019, Ethayarajh et al., 2018).
- Recommendation and Item Embedding: Sensitivity to hyperparameters such as window size, negative exponent, and subsampling threshold is acute; tuned parameters can yield up to 700% performance improvements over NLP defaults (Caselles-Dupré et al., 2018).
- Graph Embedding: Supports large-scale, dynamic, and attributed graphs, matching or exceeding static retraining in both link prediction and clustering, with up to 22× speedups for local updates (Peng et al., 2019, Liu et al., 2024).
- Generalized Similarity Learning: SGNS-style cross-entropy objectives provide better preservation of high-similarity pairs compared to standard 2 losses in matrix factorization frameworks, especially for node classification and link prediction (Zhu et al., 2021).
7. Extensions, Limitations, and Broader Connections
SGNS forms the foundation for advanced representation learning models:
- Bayesian Neural Embeddings: Variational-Bayes SGNS yields uncertainty-aware embeddings, slightly outperforming point-estimate SGNS on standard benchmarks (Barkan, 2016).
- Word-Context Classification Framework: SGNS is a special case of a broader “Word–Context Classification” model family, parameterized by the choice of noise distribution, factorization structure, and objective (Wang et al., 2020). Adaptive negative sampling via GAN-style objectives further enhances robustness.
- Hyperbolic Embedding: The removal of the sigmoid in SGNS results in direct low-rank factorizations of squashed shifted PMI matrices, connecting learned probabilities to hyperbolic geometry and complex network structure (Assylbekov et al., 2020).
Limitations of SGNS include sensitivity to negative sampling strategies, ambiguity in embedding geometry absent regularization, and the assumption of independence between word and context distributions. Quadratic regularization, incremental training, and dimension-level constraints have each been proposed to address these issues. SGNS’s pervasive influence on language, network, and item representation learning, combined with ongoing theoretical developments, ensure its continued relevance as both a methodological benchmark and a substrate for future model development.