Papers
Topics
Authors
Recent
2000 character limit reached

Span Corruption Pre-Training

Updated 9 February 2026
  • Span corruption pre-training is a method that replaces variable-length contiguous spans with sentinel tokens to enhance both local and global context learning.
  • It leverages probabilistic span selection, using geometric or Poisson distributions and auxiliary objectives to improve reconstruction in encoder-decoder models.
  • Its applications span NLP tasks and protein structure prediction, consistently delivering empirical performance gains across diverse benchmarks.

Span corruption pre-training is a foundational paradigm in modern self-supervised LLM and structural representation learning. It generalizes masked language modeling by replacing (or corrupting) variable-length contiguous spans—rather than isolated tokens or fixed contiguous blocks—in input sequences with sentinel or mask tokens, and formulating the model’s pre-training task as reconstructing or distinguishing the corrupted content. This method has been central to advances in Transformer-based encoder-decoder architectures, hierarchical and discourse-aware modeling, contrastive representation learning for retrieval, and extends successfully beyond language to domains such as protein structure prediction.

1. Core Formulation and Variants

Span corruption, in its canonical form, involves the following workflow:

  • Given a token sequence x=(x1,,xN)x = (x_1, \dots, x_N), sample disjoint, possibly variable-length, spans using a distribution (commonly geometric, Poisson, or mixture of uniforms).
  • Replace each span with a unique sentinel token (e.g., <<extra_id_i>> or <<special_token_z>>).
  • The decoder is tasked to autoregressively reconstruct the original content, typically by outputting a sequence interleaving sentinels and the corresponding gold spans in order.
  • The cross-entropy loss is computed over the target sequence, sometimes with auxiliary objectives.

Key variants include:

  • Geometric or Poisson span length distributions: T5 and DEPTH use geometric (λ=3\lambda = 3), DeltaLM uses geometric for monolingual (mean 3) and fixed-length for translation spans (Bamberger et al., 2024, Ma et al., 2021).
  • Sentinel ID assignment: Randomized per-span in DEPTH to prevent position leakage; monotonic/ordered in T5 and DeltaLM (Bamberger et al., 2024, Ma et al., 2021).
  • Auxiliary corruption strategies: Sentence/paragraph shuffle or replacement at coarser levels for discourse structure (Mim et al., 2020).
  • Contrastive span prediction: Discriminative alignment of encoder global representations to its own sampled span embeddings instead of autoregressive decoding (Ma et al., 2022).

Table 1 summarizes representative span corruption frameworks:

Method Span Sampling Reconstruction Target Domain
T5 / DeltaLM Geometric Sentinels + span tokens NLP
DEPTH Geometric Hierarchical (sentence+span) NLP
SSR Poisson Rewrite noisy spans NLP
ERNIE-GEN Multi-granularity Dual flow (word/span) NLP
COSTA Multi-granularity Contrastive (no decoder) IR
SMPC (protein) Poisson Span-masked, bi-level rec. Protein

2. Sampling Strategies and Masking Objectives

The span selection mechanism is crucial in determining both local and global context modeling efficiency:

  • T5/DEPTH: Spans are sampled from a geometric distribution (λ=3\lambda=3 for T5/DEPTH) with an overall masking rate of 30% (DEPTH), masking non-sentence tokens while skipping sentence boundary markers. Randomization of the sentinel IDs ensures the model cannot trivially infer original positions, a critical modification to address discourse objectives in DEPTH (Bamberger et al., 2024).
  • ERNIE-GEN: Employs simultaneous sampling from short (Uniform(1,4), 40%) and long (Uniform(4,32), 60%) span distributions to ensure multi-granularity context learning (Xiao et al., 2020).
  • DeltaLM: In monolingual objectives, masks 15% of tokens in spans (average length 3), whereas in translation, 50% of tokens are masked in short spans (length=3), reflecting increased difficulty in the translation context (Ma et al., 2021).
  • Protein SMPC: Spans are drawn from a Poisson(λ=6\lambda=6) distribution over residues, masking 30% of total residues per protein chain (Zhao et al., 2024).

The reconstruction target generally consists of the concatenated sentinels and original spans, preserving span order. In contrast, SSR replaces the typical "fill-in-the-blank" with an error-correcting objective: the input includes machine-generated corruptions which must be rewritten to the ground truth, yielding a strong alignment with editing tasks (Zhou et al., 2021).

3. Architectural and Hierarchical Integration

Span corruption pre-training is tightly linked to the evolution of model architectures:

  • Encoder-Decoder Models: T5, DeltaLM, DEPTH, and ERNIE-GEN use encoder-decoder Transformers where span corruption loss is defined on the decoder. DEPTH extends this by imposing hierarchical sentence markers (<SENT_i>, <EOSEN>) in both vocabulary and attention masks—forcing local semantic pooling and global discourse encoding (Bamberger et al., 2024).
  • Hierarchical Attention Masks (DEPTH): Non-sentence tokens attend bidirectionally in the encoder, but sentence marker tokens attend only to their own sentence. In the decoder, sentence markers attend only to previous sentence markers, isolating discourse flow. Cross-attention for non-sentence tokens covers the entire encoder, while sentence markers cross-attend only to encoded sentence markers (Bamberger et al., 2024).
  • Contrastive Architectures (COSTA): Removes the decoder, adds a span-pooling head and a projector atop the encoder for contrastive alignment, encouraging span-to-global matching completely within the encoder (Ma et al., 2022).
  • Protein Graph Bi-level Architectures: Masked span corruption applies at both residue (removing all side-chain atoms but Cα in masked spans) and atom levels (additive Gaussian noise), with message passing on dual graphs and cross-hierarchy feature sharing (Zhao et al., 2024).

4. Combined and Hybrid Objectives

Numerous frameworks integrate span corruption with auxiliary discrimination or correction tasks to exploit additional signals:

  • DEPTH: Jointly optimizes sentence un-shuffling (discourse) and standard span-corruption (reconstruction), balancing both via equal cross-entropy loss terms (Bamberger et al., 2024).
  • SpacTor-T5: Combines span corruption with replaced token detection (RTD) in a two-stage curriculum. Early training alternates between RTD (ELECTRA-style all-token discrimination) and standard span corruption, then transitions to pure span corruption after an empirically determined inflection point (e.g., 250k steps for T5-Base). This schedule achieves a 40% reduction in total pre-training FLOPs while matching or exceeding baseline performance (Ye et al., 2024).
  • Essay/Discourse Models: Augment token-level MLM with shuffle/drop/copy corruptions at sentence and paragraph levels, requiring downstream classifiers to distinguish original vs. corrupted structures. The multi-task loss adds a corruption-label discrimination loss to the MLM loss (Mim et al., 2020).
  • SSR: Rather than mere denoising, leverages a large PTLM as a span “corruptor,” and trains the main model to rewrite these imperfect spans into the gold text (Zhou et al., 2021).

5. Empirical Impact and Comparative Evaluation

Span corruption has consistently demonstrated performance gains across diverse benchmarks and modeling goals:

  • DEPTH: Outperforms standard T5 in span-corruption (reconstruction) loss throughout training, particularly in the early stages. At 64k steps in from-scratch training, DEPTH achieves a reconstruction loss of 1.65 vs. T5’s 1.95, despite the added sentence-level objective (Bamberger et al., 2024).
  • SpacTor-T5: Achieves downstream performance parity with standard T5-Base using half the number of training steps and 40% fewer FLOPs. Continual hybrid objectives (no curriculum) degrade performance after the initial phase, underscoring the benefit of staged switching (Ye et al., 2024).
  • Contrastive Span Prediction (COSTA): Yields significant improvements in retrieval metrics, e.g., MRR@10=0.366 (MS MARCO Passage) vs. BERT-base (0.335), outperforming both vanilla masked/autoencoding and earlier weak-decoder baselines (Ma et al., 2022).
  • Protein SMPC: Span-masked bi-level models show marked improvements on protein function (EC number F-max: 0.914 vs. 0.898), small-molecule binding site (IoU: 60.1 vs. 57.7), and nucleic acid binding tasks (Zhao et al., 2024).
  • Discourse Corruption: Multi-level span corruptions (combining complete shuffle, partial shuffle, drop, and replacement) enable state-of-the-art prediction of essay organization, with best models achieving MSE of 0.155 on the ICLE benchmark (Mim et al., 2020).

Table 2: Representative Evaluation Metrics

Model/Task Metric Baseline Span-Corruption Variant Gain
DEPTH vs. T5 rec-loss @64k 1.95 1.65 -0.30
SpacTor-T5 Steps to parity 1M 0.5M -50%
COSTA MRR@10 (MSM) 0.335 0.366 +0.031
Essay Org. MSE (ICLE) 0.175 0.155 -0.020
Protein SMPC EC F-max 0.898 0.914 +0.016

6. Generalizations and Domain Diversity

The span corruption methodology has proven robust across diverse modeling challenges:

  • Multilingual Pre-Training: DeltaLM and similar models employ span corruption for both monolingual and translation denoising, covering over 100 languages (Ma et al., 2021).
  • Discourse and Hierarchical Modeling: Span corruption enables hierarchically-structured objectives (DEPTH, essay scoring), teaching models both sentence-level ordering and intra-sentence dependency (Bamberger et al., 2024, Mim et al., 2020).
  • Dense Retrieval and Representation Learning: Contrastive alignment over sampled spans replaces generative reconstruction, producing more discriminative embeddings for retrieval (Ma et al., 2022).
  • Structural Biology: The span mask paradigm enables bi-level pre-training of protein structures, demonstrating that masking entire 3D residue spans prevents local information leakage, forcing long-range context learning (Zhao et al., 2024).

A plausible implication is that the fundamental insight of span corruption—forcing models to learn both local and global context via structured masking—generalizes well to any domain with naturally chunked or hierarchical data.

7. Limitations, Ablations, and Best Practices

Research has identified important nuances and practical recommendations:

  • Sentinel Randomization: In discourse-aware objectives (DEPTH), randomized sentinel ID assignment is necessary; monotonic schemes can leak position information, invalidating permutation-based training (Bamberger et al., 2024).
  • Span Granularity: Both COSTA and ERNIE-GEN demonstrate that multi-granularity span sampling (word, phrase, sentence, paragraph) improves generalization, but tuning the number and size of spans is critical as too many/few degrade results (Ma et al., 2022, Xiao et al., 2020).
  • Auxiliary Loss Weighting: Empirically, increasing the weight on sentence-level losses did not yield further gains in DEPTH (Bamberger et al., 2024).
  • Curriculum Learning: SSR and SpacTor-T5 show that scheduling example difficulty or hybrid objectives (curriculum learning) provides consistent boosts, whereas anti-curriculum or naive mixing can harm performance (Zhou et al., 2021, Ye et al., 2024).
  • Decoder Bypass Effect: COSTA argues that even weak decoders in autoencoder frameworks allow bypass, diluting encoder learning. Discriminative, decoderless objectives resolve this issue (Ma et al., 2022).
  • Arbitrary Span Corruption for Discourse: Shuffling, dropping, and replacement at sentence or paragraph scale can be composed to form multi-way discrimination losses for models targeting long-form or hierarchical structure prediction (Mim et al., 2020).

Initializations, masking ratios, span distributions, learning rates, and joint training schedules are all critical hyper-parameters, often reusing those optimized in early T5 or BART studies.


References: (Bamberger et al., 2024) DEPTH: Discourse Education through Pre-Training Hierarchically (Zhou et al., 2021) Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting (Xiao et al., 2020) ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation (Ye et al., 2024) SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection (Ma et al., 2021) DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders (Ma et al., 2022) Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction (Mim et al., 2020) Corruption Is Not All Bad: Incorporating Discourse Structure into Pre-training via Corruption for Essay Scoring (Zhao et al., 2024) Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Span Corruption Pre-Training.