Skip-gram Model

Updated 7 June 2026

Skip-gram model is a predictive framework that learns continuous word embeddings by maximizing the likelihood of context words given a target word.
The model utilizes scalable techniques such as negative sampling and frequency-based subsampling to enhance efficiency and improve semantic representations.
Empirical evaluations show that improvements in subsampling and negative sampling yield better accuracy on analogy tasks and provide up to 2–3× faster training.

The Skip-gram model is a foundational approach within distributional semantics and neural network–based word representation learning, introduced explicitly in the context of the word2vec framework for capturing syntactic and semantic relationships among words in large text corpora. It aims to produce continuous distributed vector representations (embeddings) of words such that words with similar contexts in a corpus are positioned closely in the embedding space. The Skip-gram objective is to maximize the likelihood of context words given a center word, with model training relying on scalable techniques such as negative sampling or hierarchical softmax. Extensions and empirical refinements, particularly in subsampling schemes for frequent words and the design of negative sampling distributions, have driven efficiency and embedding quality improvements (Mikolov et al., 2013, Jiao et al., 2019).

1. Model Architecture and Objective Function

The Skip-gram model defines a predictive framework: for a given target word $w_t$ in a sequence $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ with context window size $c$ , the model seeks to optimize the conditional likelihood of observing surrounding context words $w_{t+j}$ $(j \in \{-c,...,-1,1,...,c\})$ , given $w_t$ . The skip-gram log-likelihood for a training corpus is

$\mathcal{L} = \sum_{t=1}^{T}\sum_{-c\leq j \leq c, j\neq 0} \log\, p(w_{t+j}\mid w_t)$

where $p(w_{o}|w_I)$ is parameterized with two sets of vectors ( $v$ for center/target, $u$ for context) and typically modeled via a softmax over the full vocabulary,

$(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 0

The computational bottleneck of the denominator led to stochastic approximations including hierarchical softmax and negative sampling (Mikolov et al., 2013).

2. Negative Sampling and Improvements

Negative sampling replaces the multi-class softmax loss with local binary classification tasks: for each (center, context) positive pair, sample $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 1 “negative” context words from a noise distribution $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 2, and train the model to distinguish true context pairs from noise. The canonical noise distribution in word2vec is

$(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 3

where $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 4 is the empirical unigram frequency. This heuristic empirically improves rare word embedding quality but is not optimal. Recent work proposes semantics-aware subsampling of negatives using a sub-sampled unigram distribution, where the threshold for discarding frequent words is grounded in the semantic–syntactic balance as implied by Zipf’s law (Jiao et al., 2019). This approach adaptively reduces the gradient wasted on function words and further improves semantic analogy task accuracies.

3. Subsampling of Frequent Words

Training data for the Skip-gram model is highly redundant for frequent function words, leading to both computational inefficiency and suboptimal embedding geometries. To address this, the model adopts a straightforward discarding scheme: for each word $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 5 with empirical frequency $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 6, discard it during training with probability

$(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 7

where $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 8 is typically $(w_{t-c}, ..., w_{t}, ..., w_{t+c})$ 9. This reduces the dominance of highest-frequency tokens, reallocates learning capacity toward mid-frequency and rare words, and accelerates convergence. Empirical results on a 1-billion word corpus showed a 2–3× speedup with preserved or improved rare-word accuracy when $c$ 0 (Mikolov et al., 2013).

4. Phrases and Compositionality

The original Skip-gram model is word-based and agnostic to n-gram or phrase structure, which limits its ability to represent non-compositional phenomena such as idiomatic phrases (“Air Canada”). Mikolov et al. introduced a heuristic phrase detection method to automatically identify phrase candidates in large text corpora, training phrase-level embeddings in the Skip-gram framework. This allows the vector space to encode both single-token and multi-token lexical units, improving the downstream utility of the learned representations for a variety of NLP tasks (Mikolov et al., 2013).

5. Empirical Evaluation and Performance

The Skip-gram model, with its softmax variants and improved negative sampling distributions, has been systematically evaluated on intrinsic tasks (such as word similarity, analogy resolution) and extrinsic tasks. For analogy tasks on a 1B-token corpus, subsampled negative sampling improved total accuracy from 63.0% to 65.3%, with particular gains on semantic analogies. Subsampling of training data reduced wall-clock time by 2–3×, with analogous improvements in computational efficiency reported across model variants (Mikolov et al., 2013, Jiao et al., 2019).

6. Algorithmic Implementation

A typical Skip-gram training pipeline proceeds as follows:

Frequency calculation: One pass over the corpus to compute $c$ 1 for all $c$ 2 and precompute $c$ 3.
Subsampling: During training, discard $c$ 4 with probability $c$ 5.
Context extraction: For each center word $c$ 6 in the filtered sentence, extract pairs $c$ 7 for each $c$ 8.
Model update: For each $c$ 9, perform a Skip-gram update (positive) and $w_{t+j}$ 0 negative sampling updates (negatives). Negative examples are drawn according to the chosen noise distribution (either $w_{t+j}$ 1, or a semantics-aware sub-sampled unigram distribution) (Mikolov et al., 2013, Jiao et al., 2019).

7. Extensions, Limitations, and Research Directions

The Skip-gram model catalyzed the field of neural vector representations for language, but its context-independence and word-level granularity are limiting for polysemous words and phrase meaning. Recent advances in noise distribution design, semantics-calibrated negative sampling, and adaptive frequency-based subsampling directly build on the original methodology. Ongoing research explores dynamic context-dependent embeddings, integration with large-vocabulary LLMs, and scalable training paradigms that optimize the trade-off between computational efficiency and representational fidelity (Mikolov et al., 2013, Jiao et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

Distributed Representations of Words and Phrases and their Compositionality (2013)

Improving Word Representations: A Sub-sampled Unigram Distribution for Negative Sampling (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skip-gram Model.

Skip-gram Model

1. Model Architecture and Objective Function

2. Negative Sampling and Improvements

3. Subsampling of Frequent Words

4. Phrases and Compositionality

5. Empirical Evaluation and Performance

6. Algorithmic Implementation

7. Extensions, Limitations, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Skip-gram Model

1. Model Architecture and Objective Function

2. Negative Sampling and Improvements

3. Subsampling of Frequent Words

4. Phrases and Compositionality

5. Empirical Evaluation and Performance

6. Algorithmic Implementation

7. Extensions, Limitations, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research