Contrastive Span Prediction

Updated 23 February 2026

Contrastive span prediction is a method that refines neural text spans by drawing positive pairs together and pushing negative pairs apart using explicit contrastive loss.
It integrates with various NLP tasks such as dense retrieval, machine reading comprehension, constituency parsing, and named entity recognition to provide fine-grained discrimination.
The approach leverages transformer-based architectures and combined loss strategies to achieve state-of-the-art results across diverse datasets and benchmarks.

Contrastive span prediction refers to a family of contrastive learning methods that directly supervise the geometry of span representations within deep neural models, typically to encourage fine-grained discrimination between spans of interest in sequence tasks. The central objective is to pull together the embeddings of "positive" span pairs (semantically or structurally associated) and push away "negatives" (semantically or structurally unrelated), leveraging explicit supervision at the span level rather than, or in addition to, document, sentence, or token representations. This paradigm has emerged as a unifying principle across dense retrieval, machine reading comprehension, constituency parsing, named entity recognition, and fine-grained document analysis, demonstrating state-of-the-art results across these areas.

1. Principles and Formalization of Contrastive Span Prediction

Contrastive span prediction augments standard pre-training or downstream objectives with a loss enforcing that representations of certain spans—typically defined by gold annotations, structural relations, or task-specific criteria—are closer in embedding space, while unrelated or adversarially perturbed spans are farther apart. Formally, given a set of candidate spans $S$ in a batch and associated span representations $r_a, r_b$ , the usual supervised contrastive loss is:

$\mathcal{L}_{\text{scl}} = -\sum_{i} \frac{1}{|P(i)|} \sum_{p \in P(i)} \log \frac{ \exp(\mathrm{sim}(r_i, r_p)/\tau) }{ \sum_{a \neq i} \exp(\mathrm{sim}(r_i, r_a)/\tau) }$

where $P(i)$ are positive spans for $i$ , $\mathrm{sim}$ is typically cosine similarity, and $\tau$ is a temperature hyperparameter. The particular definitions of positive and negative pairs vary by task and method (Ma et al., 2022, Si et al., 2022, Guo et al., 27 May 2025).

Contrasts may be established between:

Full texts and their random subspans (COSTA (Ma et al., 2022))
Answerable questions and carefully constructed unanswerable variants at the span level (spanCL (Ji et al., 2022))
Gold constituent spans and structurally-adjacent (positive) or boundary-adjacent (negative) spans in parsing (Guo et al., 27 May 2025)
Spans of the same or different entity labels in NER (Si et al., 2022)

2. Model Architectures and Span Representations

In contrastive span prediction, span representations are built to capture salient local and contextual semantics, often coupling transformer-based contextual embeddings with explicit boundary or pooling schemes.

Dense Retrieval (COSTA):

Transformer encoder (e.g., BERT-base) maps token sequence $x_1,\dots,x_n$ to $h_0,\dots,h_n$ ; $h_0$ (CLS) is projected to $z_i$ .
Span representations $z_{i,p}$ obtained by average pooling over hidden states of the span tokens, then projected via an MLP + $\tanh$ (Ma et al., 2022).

Constituency Parsing:

Token embeddings from BERT are refined via a self-attentive transformer.
Spans $(i,j)$ are encoded as $r_{i,j}=[h_j-h_i\,;\,h_j+h_i]$ , with child/parent/sibling structural relations (Guo et al., 27 May 2025).

NER:

Span $s_{i,j}$ represented by boundaries, difference, and interaction: $h_i \|\ h_j \|\ (h_i-h_j)\|\ (h_i \odot h_j)$ , followed by a 2-layer MLP and nonlinearity (Si et al., 2022).

Machine Reading Comprehension:

Span representations $z^Q = [h^{Q}_{y_s}; h^{Q}_{y_e}]$ formed by concatenating representations of the gold start/end tokens for the question and passage input (Ji et al., 2022).

Document Structure Tasks:

Paragraph-level and section-level embeddings fused with contextual and structural encoding (e.g., GATs in Sci-SpanDet, (Yin et al., 1 Oct 2025)), with span predictions handled via BIO-CRF layers and pointer mechanisms.

3. Positive and Negative Span Mining

Span contrastive frameworks hinge on the mining of informative positive and negative pairs:

Paper/Task	Positive Pairs	Negative Pairs
COSTA	Full text & its own spans	Any span/text from other documents
spanCL (MRC)	Answerable Qs & paraphrases	Negative Qs (unanswerable perturbations)
SCL-RAI (NER)	Same-label spans	Other-label spans in the batch
CTPT (Parsing)	Structural adjacents	Boundary-adjacent invalid spans
Sci-SpanDet	Same-source, same-section	Different-section/source paragraphs

Positive span construction typically leverages task structure: paraphrasing, structural adjacency, or gold-label identity. Negatives may be drawn adversarially (e.g., semantically plausible but invalid spans), via random sampling, or via batch-level construction.

4. Combined Objectives and Training Strategies

Contrastive span prediction is seldom used in isolation; it is integrated with canonical loss terms for each sequence task:

Joint loss formulations combine supervised loss (e.g., cross-entropy for NER/QA, tree max-margin for parsing) with scaled contrastive span losses, e.g., $\mathcal{L} = (1-\lambda)\mathcal{L}_{\mathrm{ce}} + \lambda\mathcal{L}_{\mathrm{scl}}$ (Si et al., 2022), or $\mathcal{L} = \lambda_1\mathcal{L}_{\mathrm{span}} + \lambda_2\mathcal{L}_{\mathrm{spanCL}}$ (Ji et al., 2022).
In parsing, strict stagewise separation is sometimes crucial: contrastive span pre-training is performed before adding the tree-scoring loss, as simultaneous optimization destabilizes training (Guo et al., 27 May 2025).
For retrieval, COSTA combines group-wise contrastive loss with MLM, $\mathcal{L} = \lambda\mathcal{L}_{\mathrm{GWC}} + \mathcal{L}_{\mathrm{MLM}}$ (Ma et al., 2022).
For generation or detection tasks, additional losses (e.g., CRF NLL, pointer B/MCE, cluster/prototype separation) supplement contrastive objectives (Yin et al., 1 Oct 2025).

Hyperparameters, such as loss weights ( $\lambda$ ), temperature ( $\tau$ ), and span sampling rates, are tuned to maximize discriminative power and convergence properties.

5. Applications and Empirical Results

Contrastive span prediction establishes new standards across diverse NLP tasks:

Dense Retrieval: COSTA achieves MRR@10 of 0.366 on MS MARCO with static hard negatives, a +7% absolute improvement over previous weak-decoder baselines, and demonstrates superior clustering of true positive pairs in t-SNE visualization (Ma et al., 2022).

Machine Reading Comprehension: spanCL yields consistent dev-set improvements (EM/F1) of 0.9–2.1/0.8–2.0 points across BERT, RoBERTa, and ALBERT backbones on SQuAD 2.0, by directly contrasting answerable and subtly-unanswerable span variants (Ji et al., 2022).

Named Entity Recognition: SCL-RAI surpasses previous SOTA by up to 8.64 F1 on real-world datasets. The contrastive span loss improves robustness to unlabeled entities and dramatically shrinks intra-class span clusters (Si et al., 2022).

Constituency Parsing: On cross-domain multi-dataset benchmarks, CTPT on LLM back-generated silver treebanks attains average MCTB F1 of 88.48, outstripping domain-adaptive masked LM pre-training and natural-data bootstrapping strategies (Guo et al., 27 May 2025).

AI-Generated Text Detection: Sci-SpanDet integrates contrastive learning over section-aligned and source-conditioned paragraph embeddings, yielding F1(AI)=80.17, Span-F1=74.36, and superior calibration on a 100K-sample cross-disciplinary dataset (Yin et al., 1 Oct 2025). Removing the multi-level contrastive terms causes a marked drop in Span-F1.

6. Variants, Ablations, and Design Insights

Several design decisions are consistently substantiated:

Explicit span-level contrast is more effective than sentence-level or [CLS]-based contrast for tasks where fine-grained discrimination is essential (e.g., distinguishing answerable from subtly-unanswerable variants in MRC) (Ji et al., 2022).
In NER, pulling together spans of the same gold label forms tight, robust clusters even under the presence of noisy "O" (non-entity) spans; retrieval-augmented inference further stabilizes predictions (Si et al., 2022).
Structural span mining (parent, sibling, child in parsing) and close-boundary negatives generate particularly informative span contrasts (Guo et al., 27 May 2025).
In ablation studies, key factors such as the inclusion of paragraph-level spans (COSTA), span count per example, span encoder nonlinearity, and temperature parameter can markedly alter downstream performance (Ma et al., 2022).
Combining contrastive span pre-training with task-specific fine-tuning (e.g., max-margin tree loss) is more stable and effective than simultaneous optimization in parsing (Guo et al., 27 May 2025).

A plausible implication is that task-specific construction of positive/negative pairs, especially when exploiting structural or semantic task properties, is vital to the success of contrastive span prediction.

7. Limitations and Prospects

Contrastive span prediction frameworks introduce certain costs, such as increased sampling and memory requirements (e.g., high span count per batch), and sensitivity to negative/positive sampling strategies. Addressing these challenges—via curriculum or adaptive span sampling, extending to cross-lingual or long-document settings, or integrating with multi-vector or late-interaction retrieval models—constitutes an active area of research (Ma et al., 2022, Guo et al., 27 May 2025). Empirical evidence also suggests diminishing returns past moderate span sampling rates, and that effective loss balancing and encoder architectures are crucial. Theoretical investigation of the optimality and geometry of learned span representations remains an open direction.

In conclusion, contrastive span prediction has become a central architectural and learning-device motif in contemporary sequence prediction research. By directly supervising the geometry of span representations, these methods have delivered substantial advances in retrieval, comprehension, parsing, NER, and document-level detection. The underlying paradigm—span-centric contrastive learning—now serves as a design primitive for robust, generalizing, and fine-grained sequence encoders.