Papers
Topics
Authors
Recent
2000 character limit reached

Dynamic Span Masking in Language Models

Updated 6 January 2026
  • Dynamic span masking is a pretraining method that selects contiguous token spans based on PMI to identify strongly collocated n-grams.
  • It segments input texts into maximal collocated spans using a curated vocabulary and applies a masking budget with varied substitution modes.
  • Empirical results show that this approach accelerates training and improves downstream performance on tasks like SQuAD2.0 and RACE compared to traditional masking methods.

Dynamic span masking is a pretraining strategy for masked LLMs (MLMs) in which spans—sequences of contiguous tokens—are selected for masking based on data-driven, sequence-specific criteria rather than uniform or random heuristics. In particular, dynamic span masking can be instantiated using PMI-Masking, a methodology that selects highly collocated nn-grams according to principled extensions of pointwise mutual information (PMI), thereby producing more challenging and semantically meaningful pretext tasks compared to random token masking. This approach has been demonstrated to accelerate training and improve downstream performance of transformer-based models such as BERT, while providing a unified framework that encompasses existing heuristic masking schemes (Levine et al., 2020).

1. Principle of Pointwise Mutual Information for Span Selection

PMI-based dynamic span masking leverages the statistical association between tokens. The bigram PMI for two tokens w1,w2w_1, w_2 is defined as:

PMI(w1w2)=logp(w1,w2)p(w1)p(w2)\mathrm{PMI}(w_1w_2) = \log\frac{p(w_1,w_2)}{p(w_1)p(w_2)}

where p(w1,w2)p(w_1,w_2) is the empirical probability of the bigram and p(w1)p(w_1), p(w2)p(w_2) are unigram probabilities. For nn-grams, the naïve generalization,

Naive-PMIn(w1wn)=logp(w1wn)j=1np(wj),\mathrm{Naive\text{-}PMI}_n(w_1\cdots w_n) = \log\frac{p(w_1\cdots w_n)}{\prod_{j=1}^n p(w_j)},

is susceptible to high PMI values due to strongly collocated subspans (e.g., "Kuala Lumpur" within "Kuala Lumpur is"). To mitigate this, PMI-Masking introduces a principled nn-gram PMI:

PMIn(w1wn)=minσseg(w1wn)logp(w1wn)sσp(s)\mathrm{PMI}_n(w_1\cdots w_n) = \min_{\sigma \in \mathrm{seg}(w_1\cdots w_n)} \log \frac{p(w_1\cdots w_n)}{\prod_{s \in \sigma} p(s)}

where seg(w1wn)\mathrm{seg}(w_1\cdots w_n) denotes all nontrivial contiguous segmentations. This minimum identifies the weakest collocational link within the nn-gram, ensuring that only uniformly strong collocations are selected.

2. Estimation and Construction of Collocation Vocabulary

Empirical probabilities are estimated from counts over a large pretraining corpus (Wiki and BookCorpus, optionally plus OpenWebText). Unigram and nn-gram probabilities are defined as:

p(w)=count(w)tcount(t)p(w) = \frac{\text{count}(w)}{\sum_t \text{count}(t)}

p(w1wn)=count(w1wn)all n-grams1p(w_1\cdots w_n) = \frac{\text{count}(w_1\cdots w_n)}{\sum_{\text{all }n\text{-grams}} 1}

Every contiguous span of length 2n52\leq n\leq5 occurring at least 10 times is considered. The highest-scoring segments for each nn are ranked by their PMIn\mathrm{PMI}_n, and the top spans are merged across all lengths to yield a collocation vocabulary of size M=800,000M=800,000, sufficient to cover approximately half of all corpus tokens (as determined by a held-out annotation study).

3. Dynamic Span-Masking Algorithm

Given a sequence of tokens T=(t1,,tL)T=(t_1,\ldots,t_L) and a collocation vocabulary C\mathcal{C}, the dynamic span-masking procedure consists of:

  1. Segmentation: Partition TT into a list of non-overlapping units, where each unit is either the longest available collocated nn-gram from C\mathcal{C} or a single token if no match is found. The process greedily selects the longest span from the current position, preventing overlap.
  2. Sampling Mask Units: With a mask budget of b=0.15Lb=0.15L tokens, sampling proceeds by selecting units uniformly at random from those available, adding their token count to the cumulative mask total until the budget is reached.
  3. Applying Masking Modes: For each selected span:
    • With probability 0.8, replace every token in the span with [MASK];
    • With probability 0.1, substitute each token with a randomly chosen token;
    • With probability 0.1, leave tokens unchanged.

This procedure generalizes multiple prior masking schemes by varying the construction of C\mathcal{C}. The selection process ensures disjoint spans and always gives precedence to the longest matching collocation at each step.

4. Relationship to Prior Masking Schemes

Dynamic span masking via PMI-Masking encompasses and formally extends several previously proposed strategies:

Masking Scheme Collocation Vocabulary C\mathcal{C} Masking Unit Definition
Random-Token Masking C=\mathcal{C} = \emptyset Single tokens
Whole-Word Masking All sub-word tokens constituting the same word Whole words
Entity/Phrase Masking Spans corresponding to parsed named entities or syntactic phrases Semantic or syntactic phrases
Random-Span Masking Large random sample from geometric length spans Random contiguous token spans
PMI-Masking Top PMI-ranked nn-grams from corpus statistics Strongly collocated nn-grams

By this mechanism, dynamic span masking serves as a smoothly parameterizable superset of prior approaches, selecting spans based on corpus-calibrated co-occurrence rather than arbitrary heuristic criteria (Levine et al., 2020).

5. Empirical Results and Training Efficiency

Experiments utilizing BERTBASE_\mathrm{BASE} (12-layer, 3072 hidden size, 768-dimensional embeddings) on Wiki + BookCorpus (16 GB) and an augmented set with OpenWebText (54 GB) demonstrate that PMI-Masking offers superior training efficiency and downstream task performance relative to random-span and random-token masking. Notable findings include:

  • On SQuAD2.0, PMI-Masking attains F1 ≈ 80.3 after approximately 600 K steps, compared to 1 M steps for random-span masking.
  • With additional data (54 GB), performance continues to scale, with random-span masking lagging behind until 2.4 M steps.
  • On 1 M steps (16 GB), PMI-Masking yields SQuAD2.0 F1 81.4, RACE accuracy 68.4, and GLUE 84.1, outperforming both random-token and random-span masking.
  • At 2.4 M steps (54 GB), PMI-Masking matches or exceeds the RACE performance of models trained with substantially more data and larger batch sizes (e.g., RoBERTaBASE_\mathrm{BASE} with 160 GB and 4×1094\times10^9 examples achieves 73.0% vs. 73.2% for PMI-Masking with 54 GB and 6.14×1086.14\times10^8 examples).
  • PMI-Masking improves on SpanBERTBASE_\mathrm{BASE} by +2.2 points on RACE, using comparable setup and steps.

The adoption of dynamic span masking thus delivers both accelerated convergence and improved end-of-training metrics, reducing computational and environmental costs.

6. Implications and Future Directions

Dynamic span masking, as operationalized by PMI-Masking, demonstrates that mask budgets targeted at random single tokens underexploit bidirectional context in MLM pretraining. By focusing mask selection on strongly collocated spans, models are compelled to resolve higher-level language patterns rather than relying on local, shallow clues. The approach is fully data-driven, requiring no external annotation or syntactic resources, while effectively unifying existing masking paradigms under a principled formalism.

A plausible implication is that adaptive or hybrid span selection schemes—e.g., dynamically updating C\mathcal{C} during training as corpus statistics evolve, or integrating PMI-based selection with syntactic signals—could further enhance sample efficiency and downstream accuracy. The observed reduction in required pretraining steps and improved resource utilization position dynamic span masking as a central component in the development of scalable, environmentally conscious LLMs (Levine et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Span Masking.