Dynamic Span Masking in Language Models

Updated 6 January 2026

Dynamic span masking is a pretraining method that selects contiguous token spans based on PMI to identify strongly collocated n-grams.
It segments input texts into maximal collocated spans using a curated vocabulary and applies a masking budget with varied substitution modes.
Empirical results show that this approach accelerates training and improves downstream performance on tasks like SQuAD2.0 and RACE compared to traditional masking methods.

Dynamic span masking is a pretraining strategy for masked LLMs (MLMs) in which spans—sequences of contiguous tokens—are selected for masking based on data-driven, sequence-specific criteria rather than uniform or random heuristics. In particular, dynamic span masking can be instantiated using PMI-Masking, a methodology that selects highly collocated $n$ -grams according to principled extensions of pointwise mutual information (PMI), thereby producing more challenging and semantically meaningful pretext tasks compared to random token masking. This approach has been demonstrated to accelerate training and improve downstream performance of transformer-based models such as BERT, while providing a unified framework that encompasses existing heuristic masking schemes (Levine et al., 2020).

1. Principle of Pointwise Mutual Information for Span Selection

PMI-based dynamic span masking leverages the statistical association between tokens. The bigram PMI for two tokens $w_1, w_2$ is defined as:

$\mathrm{PMI}(w_1w_2) = \log\frac{p(w_1,w_2)}{p(w_1)p(w_2)}$

where $p(w_1,w_2)$ is the empirical probability of the bigram and $p(w_1)$ , $p(w_2)$ are unigram probabilities. For $n$ -grams, the naïve generalization,

$\mathrm{Naive\text{-}PMI}_n(w_1\cdots w_n) = \log\frac{p(w_1\cdots w_n)}{\prod_{j=1}^n p(w_j)},$

is susceptible to high PMI values due to strongly collocated subspans (e.g., "Kuala Lumpur" within "Kuala Lumpur is"). To mitigate this, PMI-Masking introduces a principled $n$ -gram PMI:

$\mathrm{PMI}_n(w_1\cdots w_n) = \min_{\sigma \in \mathrm{seg}(w_1\cdots w_n)} \log \frac{p(w_1\cdots w_n)}{\prod_{s \in \sigma} p(s)}$

where $\mathrm{seg}(w_1\cdots w_n)$ denotes all nontrivial contiguous segmentations. This minimum identifies the weakest collocational link within the $n$ -gram, ensuring that only uniformly strong collocations are selected.

2. Estimation and Construction of Collocation Vocabulary

Empirical probabilities are estimated from counts over a large pretraining corpus (Wiki and BookCorpus, optionally plus OpenWebText). Unigram and $n$ -gram probabilities are defined as:

$p(w) = \frac{\text{count}(w)}{\sum_t \text{count}(t)}$

$p(w_1\cdots w_n) = \frac{\text{count}(w_1\cdots w_n)}{\sum_{\text{all }n\text{-grams}} 1}$

Every contiguous span of length $2\leq n\leq5$ occurring at least 10 times is considered. The highest-scoring segments for each $n$ are ranked by their $\mathrm{PMI}_n$ , and the top spans are merged across all lengths to yield a collocation vocabulary of size $M=800,000$ , sufficient to cover approximately half of all corpus tokens (as determined by a held-out annotation study).

3. Dynamic Span-Masking Algorithm

Given a sequence of tokens $T=(t_1,\ldots,t_L)$ and a collocation vocabulary $\mathcal{C}$ , the dynamic span-masking procedure consists of:

Segmentation: Partition $T$ into a list of non-overlapping units, where each unit is either the longest available collocated $n$ -gram from $\mathcal{C}$ or a single token if no match is found. The process greedily selects the longest span from the current position, preventing overlap.
Sampling Mask Units: With a mask budget of $b=0.15L$ tokens, sampling proceeds by selecting units uniformly at random from those available, adding their token count to the cumulative mask total until the budget is reached.
Applying Masking Modes: For each selected span:
- With probability 0.8, replace every token in the span with [MASK];
- With probability 0.1, substitute each token with a randomly chosen token;
- With probability 0.1, leave tokens unchanged.

This procedure generalizes multiple prior masking schemes by varying the construction of $\mathcal{C}$ . The selection process ensures disjoint spans and always gives precedence to the longest matching collocation at each step.

4. Relationship to Prior Masking Schemes

Dynamic span masking via PMI-Masking encompasses and formally extends several previously proposed strategies:

Masking Scheme	Collocation Vocabulary $\mathcal{C}$	Masking Unit Definition
Random-Token Masking	$\mathcal{C} = \emptyset$	Single tokens
Whole-Word Masking	All sub-word tokens constituting the same word	Whole words
Entity/Phrase Masking	Spans corresponding to parsed named entities or syntactic phrases	Semantic or syntactic phrases
Random-Span Masking	Large random sample from geometric length spans	Random contiguous token spans
PMI-Masking	Top PMI-ranked $n$ -grams from corpus statistics	Strongly collocated $n$ -grams

By this mechanism, dynamic span masking serves as a smoothly parameterizable superset of prior approaches, selecting spans based on corpus-calibrated co-occurrence rather than arbitrary heuristic criteria (Levine et al., 2020).

5. Empirical Results and Training Efficiency

Experiments utilizing BERT $_\mathrm{BASE}$ (12-layer, 3072 hidden size, 768-dimensional embeddings) on Wiki + BookCorpus (16 GB) and an augmented set with OpenWebText (54 GB) demonstrate that PMI-Masking offers superior training efficiency and downstream task performance relative to random-span and random-token masking. Notable findings include:

On SQuAD2.0, PMI-Masking attains F1 ≈ 80.3 after approximately 600 K steps, compared to 1 M steps for random-span masking.
With additional data (54 GB), performance continues to scale, with random-span masking lagging behind until 2.4 M steps.
On 1 M steps (16 GB), PMI-Masking yields SQuAD2.0 F1 81.4, RACE accuracy 68.4, and GLUE 84.1, outperforming both random-token and random-span masking.
At 2.4 M steps (54 GB), PMI-Masking matches or exceeds the RACE performance of models trained with substantially more data and larger batch sizes (e.g., RoBERTa $_\mathrm{BASE}$ with 160 GB and $4\times10^9$ examples achieves 73.0% vs. 73.2% for PMI-Masking with 54 GB and $6.14\times10^8$ examples).
PMI-Masking improves on SpanBERT $_\mathrm{BASE}$ by +2.2 points on RACE, using comparable setup and steps.

The adoption of dynamic span masking thus delivers both accelerated convergence and improved end-of-training metrics, reducing computational and environmental costs.

6. Implications and Future Directions

Dynamic span masking, as operationalized by PMI-Masking, demonstrates that mask budgets targeted at random single tokens underexploit bidirectional context in MLM pretraining. By focusing mask selection on strongly collocated spans, models are compelled to resolve higher-level language patterns rather than relying on local, shallow clues. The approach is fully data-driven, requiring no external annotation or syntactic resources, while effectively unifying existing masking paradigms under a principled formalism.

A plausible implication is that adaptive or hybrid span selection schemes—e.g., dynamically updating $\mathcal{C}$ during training as corpus statistics evolve, or integrating PMI-based selection with syntactic signals—could further enhance sample efficiency and downstream accuracy. The observed reduction in required pretraining steps and improved resource utilization position dynamic span masking as a central component in the development of scalable, environmentally conscious LLMs (Levine et al., 2020).

Markdown Upgrade to Chat

References (1)

PMI-Masking: Principled masking of correlated spans (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Span Masking.