Span-Level Masking Pretraining

Updated 26 November 2025

Span-level masking is a technique that masks consecutive segments in sequential data, compelling models to use broader contextual and structural cues.
Its variants—random, PMI, and salient span masking—improve efficiency and accuracy by emphasizing high-level semantic relationships and robust pattern recovery.
Integrated into pretraining objectives across modalities, span-level masking has demonstrated significant downstream improvements in tasks like QA, recognition, and segmentation.

Span-level masking refers to the strategy of masking out contiguous spans—rather than isolated elements—in sequential data during self-supervised pretraining, compelling models to leverage contextual or structural cues over local, low-level patterns. Originally introduced for masked language modeling, span-level masking now underpins state-of-the-art learning objectives in natural language, speech, and image domains. Its variants, including semantic, task-aware, and information-theoretic approaches, outperform random or single-token masking both in efficiency and downstream accuracy by emphasizing inter-unit dependencies, robust pattern recovery, and richer, high-level representations.

1. Formal Definitions and Strategies

Span-level masking is characterized by the masking of consecutive, semantically or algorithmically selected blocks—spans—of input units. The formalism varies by modality and task:

Text: Spans correspond to n-gram token sequences (words, phrases, entities), identified via algorithmic (random, geometric, PMI) or semantic (NER, temporal taggers) means (Levine et al., 2020, Cole et al., 2023).
Speech: Spans comprise consecutive acoustic frames aligned to speech activity or linguistic units (e.g., entire phonemes) (Zhang et al., 2022).
Images (especially for text images): Spans are columns of image patches aligned to character or word boundaries (Tang et al., 11 May 2025).

The general span-level mask construction process involves: (1) identifying potential span candidates, (2) sampling span boundaries and lengths, and (3) generating a binary mask $M$ marking all input indices covered by a selected span set $\mathcal{S} = \{s_j\}$ .

Table: Examples of Span Masking Selection

Domain	Span Type	Selection Method
Text	Entity, PMI-n-gram	NER, PMI/segmentation
Speech	Phoneme, speech interval	ASR force-alignment, VAD
Images (text)	Patch columns (chars/words)	Uniform over columns, blockwise

2. Mathematical Formulations and Algorithms

Span-level masking mechanisms rely on explicit selection rules and systematic masking logic. Key methods include:

PMI-Masking: For token sequences $w_1,\ldots,w_n$ , spans are selected by maximizing length- $n$ “weakest-link” PMI:

$\mathrm{PMI}_n(w_1...w_n) = \min_{\sigma \in \text{segmentations}(w_1...w_n)} \log \frac{p(w_1...w_n)}{\prod_{s \in \sigma} p(s)}$

Spans with high PMI are jointly masked, emphasizing semantically-collocated units (Levine et al., 2020).

Random/Geometric Span Masking: Span lengths are sampled (e.g., geometric distribution), and locations are selected uniformly. Used in SpanBERT and as a baseline in many works.
Salient Span Masking (SSM/TSM): Only spans labeled as entities, dates (SSM), or temporal expressions (TSM) are eligible. A single such span is masked per sample, selected uniformly from all candidate spans (Cole et al., 2023).

Block and Column Span Masking (Image-Text): For images partitioned into

h\times w

patches, horizontal spans of up to

S

columns are masked:

while |M_s| < R ⋅ N: 
    s ← Uniform(1, S)   # span width
    l ← Uniform(0, w-s) # left index
    for col in l…l+s-1:
        for row in 1…h:
            M_s ← M_s ∪ {patch_index(row, col)}

Overlap control via guard-bands is integrated for diverse mask patterns (Tang et al., 11 May 2025).

3. Integration with Pretraining Objectives

Span-masked inputs are incorporated into the standard masked modeling objectives:

Text (Transformer MLMs): Masked tokens are replaced with [MASK]/random/original tokens. The loss is computed only for fully masked spans:

$\mathcal{L}_\text{span} = \mathbb{E}_{s \sim P_\text{span}(\cdot | x)} \left[ -\log p_\theta(x_s | x_{\setminus s}) \right]$

For SSM/TSM, $P_\text{span}$ is uniform over detected entity or temporal spans (Cole et al., 2023).

Speech: Masked frames (from speech/phoneme spans) are set to zero; the MLM model predicts original frames. The loss aggregates over masked spans:

$\mathcal{L} = \sum_{t} m_t \| x_t - \widetilde{x}_t \|_1$

(Zhang et al., 2022).

Images: Mask tokens substitute masked patches; the mean squared reconstruction error is computed over only the span-masked indices:

$L_s = \frac{1}{|\mathcal{M}_s|}\sum_{i\in\mathcal{M}_s} \| \tilde{x}_i - x_i \|_2^2$

(Tang et al., 11 May 2025).

In multi-strategy settings, such as the Multi-Masking Strategy (MMS) for text images, the total self-supervised loss aggregates span, block, and random masking losses (Tang et al., 11 May 2025).

4. Empirical Results and Comparative Performance

Span-level masking consistently outperforms random token/patch masking and other heuristics across diverse domains and tasks:

Speech Representation Learning:
- On Librispeech phoneme classification and speaker recognition, both speech-level and phoneme-level masking yield statistically significant improvements.
- For instance, under Mockingjay (train-clean-100), speech+phoneme-level masking achieves frame-wise speaker accuracy of 98.2%, compared to 68.4% with random masking. Combined masking also improves phoneme classification by ∼1 percentage point (Zhang et al., 2022).
- Phoneme-level masking yields sharper spectrogram reconstructions, capturing speaker/phoneme attributes more robustly.
Text MLMs:
- PMI-masking converges to SpanBERT-equivalent SQuAD2.0 F1 (∼80.3) in half as many pretraining steps; ultimate F1 reaches 83.6 (dev) and 83.3 (test), surpassing random-span and naive-PMI variants (Levine et al., 2020).
- GLUE and RACE scores are marginally higher, with SQuAD2.0 F1 rising to 84.9 when adding OpenWebText.
Task-Aware (Salient) Span Masking:
- Salient span masking (SSM) and temporal span masking (TSM) confer an average gain of +5.8 points and +1.1 points, respectively, on temporal question answering tasks, with further gains (+0.29) from mixing both. Performance on general natural QA also increases by up to +2.8 EM (Cole et al., 2023).
Image-Text (Scene Text Recognition):
- With ViT backbones and MMS, averaging random/block/span masking outperforms the best single-strategy MAE by 1.8%, with final accuracy of 81.2% (vs. random-only 77.8%), and increases PSNR in text image super-resolution (Tang et al., 11 May 2025).

5. Rationale and Theoretical Underpinnings

Span-level masking compels models to infer missing information using broader contextual clues:

Reduces Shortcut Reliance: By masking semantically coherent or contiguous spans, models cannot simply interpolate or use local continuity, but must draw on higher-order contextual or linguistic dependencies (Levine et al., 2020, Tang et al., 11 May 2025).
Semantic Alignment: Masking full entities (SSM/TSM), phonemes, or character columns ensures masked regions correspond to semantically meaningful units. This is especially critical for tasks that depend on relationship understanding (e.g., factual QA, relation extraction, temporal reasoning, and text recognition with occlusions) (Zhang et al., 2022, Cole et al., 2023, Tang et al., 11 May 2025).
Efficient Training: Information-theoretic methods (PMI) select statistically salient n-grams, focusing pretraining on predictive, high-value substructures. This increases convergence speed and enhances transferred task performance (Levine et al., 2020).

6. Comparative Analysis of Span Selection Strategies

Span-level masking encompasses multiple selection paradigms. Table below illustrates the primary variants:

Method	Selection Criterion	Principal Benefit
Random-Span	Uniform over span starts/lengths	Baseline; covers arbitrary n-grams
PMI-Masking	High-PMI n-grams from corpus	Focused on collocation/stats; no external labels
Salient Span (SSM/TSM)	Detected entities, dates, temporal spans	Task-aware, interpretable, semantic ground truth
Block/Column (vision)	Continuous patch columns	Correlates with higher-level visual/character structure

PMI-masking subsumes random-span, whole-word, and phrase masking via a unified score and corpus-driven policy (Levine et al., 2020). Salient span masking leverages linguistic knowledge via sequence labeling or pattern matching (NER, SUTime) (Cole et al., 2023). Vision-based span masking translates these concepts to the 2D space of text images, where patch columns approximate word/character occlusion (Tang et al., 11 May 2025).

7. Future Directions and Applications

Span-level masking’s paradigm generalizes broadly:

Task-specific pretraining: Masking spans most relevant to target phenomena—events, quantities, sentiments—tailors representations for improved sample efficiency and task accuracy (Cole et al., 2023).
Learned Masking Policies: Data-driven, adaptive masking, e.g., via joint training or reinforcement learning, offers further flexibility but may risk overfitting (Levine et al., 2020, Cole et al., 2023).
Multimodal Integration: The principles extend to speech, audio, and image domains where span alignments correspond to transient objects or structured signals (Zhang et al., 2022, Tang et al., 11 May 2025).
General downstream benefit: Span masking, by emphasizing meaningful intervals, improves QA, recognition, segmentation, and super-resolution, and is especially effective under limited supervision or in zero-shot transfer (Cole et al., 2023, Tang et al., 11 May 2025).

The core insight is that span-level masking not only aligns with the real-world structure of the data but also systematically regularizes and strengthens the inferred representations across a wide suite of tasks. The evidence from empirical benchmarks demonstrates consistent, sometimes dramatic, improvements relative to both random and token-level masking regimes.

Markdown Upgrade to Chat

References (4)

PMI-Masking: Principled masking of correlated spans (2020)

Salient Span Masking for Temporal Understanding (2023)

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach (2022)

Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Span-Level Masking.

Span-Level Masking Pretraining

1. Formal Definitions and Strategies

Table: Examples of Span Masking Selection

2. Mathematical Formulations and Algorithms

3. Integration with Pretraining Objectives

4. Empirical Results and Comparative Performance

5. Rationale and Theoretical Underpinnings

6. Comparative Analysis of Span Selection Strategies

7. Future Directions and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Span-Level Masking Pretraining

1. Formal Definitions and Strategies

Table: Examples of Span Masking Selection

2. Mathematical Formulations and Algorithms

3. Integration with Pretraining Objectives

4. Empirical Results and Comparative Performance

5. Rationale and Theoretical Underpinnings

6. Comparative Analysis of Span Selection Strategies

7. Future Directions and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research