Span-Corruption Pretraining Advances

Updated 24 November 2025

Span-corruption pretraining is a self-supervised approach that masks and reconstructs contiguous spans to enforce holistic span-level reasoning in language and code models.
It employs diverse masking strategies—including random, autoregressive, and structure-aware techniques—to target semantically meaningful segments such as AST subtrees and salient entities.
Empirical results demonstrate improved performance in QA, retrieval, and code generation while reducing pretraining iterations and computational overhead.

Span-corruption pretraining refers to a family of self-supervised objectives and model architectures that train neural text or code models by masking and reconstructing contiguous spans of input tokens. In contrast to token-level masking as in BERT, span-corruption systematically masks multi-token segments and tasks the model with reconstructing those segments from context. This design enforces span-level reasoning and enables masking of semantically meaningful units (e.g., code subtrees, named entities, temporal expressions, sentences, or arbitrary-length text segments). Span-corruption objectives are foundational in modern encoder, decoder, and encoder–decoder architectures such as SpanBERT, T5, GLM, COSTA, AST-T5, SpacTor-T5, and targeted curriculum variants for domain adaptation. Model instantiations differ substantially in how they select, mask, and reconstruct spans, as well as in the auxiliary losses and masking strategies that complement the main span-recovery task.

1. Core Span-Corruption Objectives

The span-corruption paradigm is instantiated in several principal forms:

Random Span Masking (e.g., SpanBERT, T5, BART): Randomly sample disjoint, contiguous spans of the input sequence, covering a fixed fraction of tokens (typically 15%), and replace each span by a unique sentinel token or [MASK] symbol. The model is trained to recover the original tokens of all masked spans from the corrupted context. Span lengths are frequently sampled from geometric or Poisson distributions, and masking occurs at the word or subword level. For instance, in SpanBERT, spans are drawn via a geometric distribution $P(\ell)=(1-p)^{\ell-1}p$ , truncated at $\ell_{max}$ , with a mean span length about 3.8 tokens (Joshi et al., 2019).
Order-Permuted or Autoregressive Infilling (GLM): After masking spans, reconstruct them in a random order rather than left-to-right, with each span predicted autoregressively conditioned on the corrupted input and previously generated spans. GLM introduces two-dimensional (row, column) positional encodings to distinguish between original context and generated span positions, hiding span length information from the encoder (Du et al., 2021).
Sentinel-based Infilling (T5, SpacTor): Masked spans are replaced by unique sentinel tokens ([S₀], [S₁], ...) in the input. The decoder must then output, for each sentinel, the corresponding original span, concatenated in order. Only the input corruption differs; the underlying encoder-decoder Transformer can be standard (Ye et al., 24 Jan 2024).
AST-Aware or Structure-Guided Span Corruption: For code domains, spans are not sampled randomly but are aligned to syntactic subtrees in the Abstract Syntax Tree (AST). AST-T5 masks out entire code subtrees, forcing recovery of semantically coherent code fragments. Mask selection is performed by dynamic programming to minimize the breaking of AST boundaries, coupled with recursive subtree masking (Gong et al., 5 Jan 2024).
Salient Span Masking (SSM/TSM): Rather than random masking, model training is biased toward "salient" units—named entities, dates, or temporally-relevant expressions—using external taggers or parsers. Each masked sentence replaces exactly one such span, meaning the model oversamples fact-rich contexts (Cole et al., 2023).

2. Algorithms for Span Selection and Masking

Span-corruption methods differ in how they sample and select spans:

Random Sampling: Uniform or geometric random selection of start positions and lengths under a constraint that total masked tokens approximate a target ratio (e.g., 15% in T5, 15% SpanBERT, up to 25% in AST-T5 for code (Gong et al., 5 Jan 2024, Joshi et al., 2019)).
Stratified or Granular Sampling: In contrastive span-corruption (e.g., COSTA), positive spans are sampled at multiple granularities—words, phrases, sentences, paragraphs—by independently sampling T spans per granularity and aggregating them per document. Span lengths are drawn from beta or geometric distributions depending on the level (Ma et al., 2022).
Syntactic or Semantic Targeting: AST-aware methods use code parsing tools (e.g., Tree-sitter) to select spans that align with AST subtrees, balancing span size with tree coherence using dynamic programming to minimize breaks (Gong et al., 5 Jan 2024). In SSM/TSM, masking is restricted to externally identified named entities or temporal expressions, as identified by taggers such as SUTime (Cole et al., 2023).
Curriculum and Hybrid Masking: SpacTor-T5 applies a curriculum, beginning with span corruption plus replaced token detection (RTD), before switching to pure span corruption. Early training enforces global attention to all tokens (RTD), transitioning to stronger span modeling in later stages (Ye et al., 24 Jan 2024).

3. Span Reconstruction and Model Architectures

Different architectures and decoding schemes are used for reconstruction:

Encoder-Only (SpanBERT, COSTA): Only the standard encoder processes masked inputs; each masked token is predicted independently based on context, or span boundary representations are used as summaries (SpanBERT SBO). In COSTA, the final [CLS] embedding is encouraged to be close to its own masked spans and far from others, using a group-wise contrastive loss (Ma et al., 2022, Joshi et al., 2019).
Encoder-Decoder (T5, GLM, AST-T5, SpacTor-T5): Masked inputs feed into the encoder; the decoder is tasked to reconstruct all masked spans, either sequentially or in a permuted/autoregressive order. AST-T5 and GLM adhere strictly to span-level modeling, while SpacTor introduces extra detection heads only during early training (Du et al., 2021, Gong et al., 5 Jan 2024, Ye et al., 24 Jan 2024).
Auxiliary Objectives: SpanBERT adds a span-boundary objective, predicting each masked token from the embeddings of tokens immediately to the left and right of its span, simulating downstream span reasoning (Joshi et al., 2019). COSTA supplements contrastive loss with a standard MLM objective. SpacTor-T5 combines RTD (from ELECTRA) with SC losses, showing that hybrid objectives, followed by pure span corruption, accelerate convergence (Ye et al., 24 Jan 2024).

4. Specializations: Code, Factuality, Temporal Reasoning

Span-corruption pretraining has been extended to specialized domains:

Code Structure-Awareness: AST-T5 masks spans aligned with AST subtrees, enforcing the model to reconstruct syntactically valid and semantically meaningful code blocks. The segmentation and masking maximize structural integrity, with higher mask ratios (25%) and spans covering from single tokens to full function bodies (Gong et al., 5 Jan 2024).
Salient/Factual Span Masking: SSM and TSM select spans corresponding to knowledge-rich ("salient") expressions such as named entities and dates or temporally explicit fragments. Pretraining on these objectives improves closed-book QA and temporal understanding, particularly in zero-shot settings where models must generalize without further fine-tuning. The improvement is largely due to oversampling challenging contexts, independent of the precise span type (Cole et al., 2023).
Dense Retrieval: Contrastive span-corruption as in COSTA maximizes discriminability by pulling global document representations toward their own sampled spans and pushing away other documents' spans. Multi-granular sampling and group-wise loss underlie strong gains in retrieval MRR and NDCG metrics over vanilla MLM methods (Ma et al., 2022).

5. Empirical Results and Efficiency Gains

Span-corruption pretraining exhibits strong empirical benefits:

NLU and Span Selection: SpanBERT achieves substantial improvements on span-based QA (SQuAD 1.1 F1 up to 94.6%), coreference resolution, and relation extraction, outpacing BERT trained with token masking. The SBO loss and contiguous span masking are critical to these advances (Joshi et al., 2019).
Multi-task Generalization: GLM with autoregressive blank infilling consistently exceeds the performance of BERT, RoBERTa, and T5 on NLU, conditional and unconditional generation benchmarks; e.g., SuperGLUE dev average 77.0 for GLM_large vs. 72.0 for BERT_large (Du et al., 2021).
Retrieval and Fact Recall: COSTA's contrastive span pretraining yields absolute MRR improvements of 1–2.6 pts on MS MARCO and TREC passage/document ranking (Ma et al., 2022). SSM/TSM improve QA and temporal understanding by up to +4.82 EM (SituatedQA) and +39.95% acc (TimeDIAL-0 zero-shot) (Cole et al., 2023).
Pretraining Efficiency: SpacTor-T5 matches or exceeds standard T5 span corruption in downstream metrics (e.g., SuperGLUE, SQuAD) with 50% fewer pretraining iterations and 40% less overall FLOPs by leveraging an early-stage hybrid curriculum (Ye et al., 24 Jan 2024).
Code Generation: AST-T5, with AST-aware span-corruption and higher mask ratio, outperforms structure-agnostic T5 variants, surpassing CodeT5 by up to 3 points EM on code translation tasks (Gong et al., 5 Jan 2024).

6. Comparisons and Limitations

Span-corruption is superior to token-level masking for contextual and compositional reasoning but presents unique challenges:

Comparison to Token MLM: Token-level MLM (BERT) can overfit on local context, making token prediction trivial if unmasked neighbors are too informative. Span masking requires broader contextual understanding and better captures inter-span dependencies. SpanBERT and GLM analyses indicate greater difficulty and improved performance for span-level tasks (Du et al., 2021, Joshi et al., 2019).
Algorithmic Overhead: Structure-aware masking (e.g., AST-aware DP, named entity recognition for SSM) may introduce non-trivial preprocessing, but with no change to model architecture (Gong et al., 5 Jan 2024, Cole et al., 2023).
Decoder Bypass: Encoder-decoder models risk "bypassing" the encoder if the decoder conditions on weak or unmasked signals. COSTA's encoder-only, decoder-free setup remedies this by removing any bypass, forcing the encoder to synthesize all needed semantics in its output (Ma et al., 2022).
Curriculum and Ablation: Hybrid objectives (SpacTor) demonstrate faster convergence only if the additional losses are phased out after a curriculum period. Excessive adversarial noise (strong generators in RTD) degrade span-recovery if carried into late-stage training (Ye et al., 24 Jan 2024).

7. Extensions and Future Directions

Span-corruption as a pretraining principle is actively evolving:

Learned or Task-Driven Span Selection: Rather than static distributions, masking could be adaptively learned (e.g., via co-occurrence statistics or PMI for salient spans), and task-driven maskings (e.g., domain-specific entities) have begun to emerge (Cole et al., 2023).
Domain Adaptation: Salient span approaches may generalize to scientific, medical, or retrieval contexts by targeting high-information spans relevant to the target application (Cole et al., 2023).
Structural and Multi-Task Generalization: GLM demonstrates that span-corruption with blank infilling and two-dimensional positional encoding can serve simultaneously as NLU, seq2seq, and pure LM pretraining objectives, unified by a single model (Du et al., 2021).
Open Problems: Best practices for mask ratio, span length, and hybrid curriculum design remain open, with empirical evidence favoring non-uniform and structure-aware strategies.

A plausible implication is that span-corruption pretraining will remain a core ingredient in both general-domain and domain-adapted LMs, offering a flexible interface for combining self-supervised learning, structure-awareness, and task-adaptivity.