Dynamic Span Masking in Language Models
- Dynamic span masking is a pretraining method that selects contiguous token spans based on PMI to identify strongly collocated n-grams.
- It segments input texts into maximal collocated spans using a curated vocabulary and applies a masking budget with varied substitution modes.
- Empirical results show that this approach accelerates training and improves downstream performance on tasks like SQuAD2.0 and RACE compared to traditional masking methods.
Dynamic span masking is a pretraining strategy for masked LLMs (MLMs) in which spans—sequences of contiguous tokens—are selected for masking based on data-driven, sequence-specific criteria rather than uniform or random heuristics. In particular, dynamic span masking can be instantiated using PMI-Masking, a methodology that selects highly collocated -grams according to principled extensions of pointwise mutual information (PMI), thereby producing more challenging and semantically meaningful pretext tasks compared to random token masking. This approach has been demonstrated to accelerate training and improve downstream performance of transformer-based models such as BERT, while providing a unified framework that encompasses existing heuristic masking schemes (Levine et al., 2020).
1. Principle of Pointwise Mutual Information for Span Selection
PMI-based dynamic span masking leverages the statistical association between tokens. The bigram PMI for two tokens is defined as:
where is the empirical probability of the bigram and , are unigram probabilities. For -grams, the naïve generalization,
is susceptible to high PMI values due to strongly collocated subspans (e.g., "Kuala Lumpur" within "Kuala Lumpur is"). To mitigate this, PMI-Masking introduces a principled -gram PMI:
where denotes all nontrivial contiguous segmentations. This minimum identifies the weakest collocational link within the -gram, ensuring that only uniformly strong collocations are selected.
2. Estimation and Construction of Collocation Vocabulary
Empirical probabilities are estimated from counts over a large pretraining corpus (Wiki and BookCorpus, optionally plus OpenWebText). Unigram and -gram probabilities are defined as:
Every contiguous span of length occurring at least 10 times is considered. The highest-scoring segments for each are ranked by their , and the top spans are merged across all lengths to yield a collocation vocabulary of size , sufficient to cover approximately half of all corpus tokens (as determined by a held-out annotation study).
3. Dynamic Span-Masking Algorithm
Given a sequence of tokens and a collocation vocabulary , the dynamic span-masking procedure consists of:
- Segmentation: Partition into a list of non-overlapping units, where each unit is either the longest available collocated -gram from or a single token if no match is found. The process greedily selects the longest span from the current position, preventing overlap.
- Sampling Mask Units: With a mask budget of tokens, sampling proceeds by selecting units uniformly at random from those available, adding their token count to the cumulative mask total until the budget is reached.
- Applying Masking Modes: For each selected span:
- With probability 0.8, replace every token in the span with [MASK];
- With probability 0.1, substitute each token with a randomly chosen token;
- With probability 0.1, leave tokens unchanged.
This procedure generalizes multiple prior masking schemes by varying the construction of . The selection process ensures disjoint spans and always gives precedence to the longest matching collocation at each step.
4. Relationship to Prior Masking Schemes
Dynamic span masking via PMI-Masking encompasses and formally extends several previously proposed strategies:
| Masking Scheme | Collocation Vocabulary | Masking Unit Definition |
|---|---|---|
| Random-Token Masking | Single tokens | |
| Whole-Word Masking | All sub-word tokens constituting the same word | Whole words |
| Entity/Phrase Masking | Spans corresponding to parsed named entities or syntactic phrases | Semantic or syntactic phrases |
| Random-Span Masking | Large random sample from geometric length spans | Random contiguous token spans |
| PMI-Masking | Top PMI-ranked -grams from corpus statistics | Strongly collocated -grams |
By this mechanism, dynamic span masking serves as a smoothly parameterizable superset of prior approaches, selecting spans based on corpus-calibrated co-occurrence rather than arbitrary heuristic criteria (Levine et al., 2020).
5. Empirical Results and Training Efficiency
Experiments utilizing BERT (12-layer, 3072 hidden size, 768-dimensional embeddings) on Wiki + BookCorpus (16 GB) and an augmented set with OpenWebText (54 GB) demonstrate that PMI-Masking offers superior training efficiency and downstream task performance relative to random-span and random-token masking. Notable findings include:
- On SQuAD2.0, PMI-Masking attains F1 ≈ 80.3 after approximately 600 K steps, compared to 1 M steps for random-span masking.
- With additional data (54 GB), performance continues to scale, with random-span masking lagging behind until 2.4 M steps.
- On 1 M steps (16 GB), PMI-Masking yields SQuAD2.0 F1 81.4, RACE accuracy 68.4, and GLUE 84.1, outperforming both random-token and random-span masking.
- At 2.4 M steps (54 GB), PMI-Masking matches or exceeds the RACE performance of models trained with substantially more data and larger batch sizes (e.g., RoBERTa with 160 GB and examples achieves 73.0% vs. 73.2% for PMI-Masking with 54 GB and examples).
- PMI-Masking improves on SpanBERT by +2.2 points on RACE, using comparable setup and steps.
The adoption of dynamic span masking thus delivers both accelerated convergence and improved end-of-training metrics, reducing computational and environmental costs.
6. Implications and Future Directions
Dynamic span masking, as operationalized by PMI-Masking, demonstrates that mask budgets targeted at random single tokens underexploit bidirectional context in MLM pretraining. By focusing mask selection on strongly collocated spans, models are compelled to resolve higher-level language patterns rather than relying on local, shallow clues. The approach is fully data-driven, requiring no external annotation or syntactic resources, while effectively unifying existing masking paradigms under a principled formalism.
A plausible implication is that adaptive or hybrid span selection schemes—e.g., dynamically updating during training as corpus statistics evolve, or integrating PMI-based selection with syntactic signals—could further enhance sample efficiency and downstream accuracy. The observed reduction in required pretraining steps and improved resource utilization position dynamic span masking as a central component in the development of scalable, environmentally conscious LLMs (Levine et al., 2020).