Error Span Detection (ESD)

Updated 9 December 2025

Error Span Detection (ESD) is a technique that explicitly localizes and labels contiguous error spans in text using structured prediction methods.
It employs models like transformer sequence taggers, multi-task encoders, or generative LLMs to classify tokens based on BIO tagging and severity levels.
ESD enhances applications such as grammatical error correction and machine translation evaluation by providing interpretable, severity-aware insights for robust diagnostics.

Error Span Detection (ESD) is a class of methods in natural language processing designed to explicitly localize, delineate, and label error spans within a text. ESD plays a critical role in both grammatical error correction and fine-grained evaluation of natural language generation outputs, such as machine translation. Unlike scalar metrics that provide only an overall quality estimate, ESD provides structured, interpretable diagnostics by pinpointing where errors occur and, in many frameworks, assigning severity levels.

1. Formal Definition and Problem Formulation

In its canonical form, ESD operates over a sequence of tokens $x = (x_1, x_2, \dots, x_n)$ —possibly after subword segmentation—and aims to identify all contiguous subsequences (spans) where the text diverges from correctness under a specific reference or annotation protocol. For grammatical error correction (GEC), this process involves aligning $x$ to a corrected reference and labeling tokens that participate in edits. For reference-free machine translation evaluation, ESD operates without gold references, instead localizing spans perceived as erroneous relative to input sources and model predictions.

Formally, ESD reduces to a sequence labeling or structured prediction task. Typical label sets include the BIO (Begin, Inside, Outside) schema, e.g., $\mathcal{Y} = \{\text{B-ERR}, \text{I-ERR}, \text{O}\}$ for error boundary identification (Chen et al., 2020); or categorical tags such as $\{\text{OK}, \text{MIN}, \text{MAJ}, \text{CRIT}\}$ for fine-grained error severity as in modern MT evaluation (Guerreiro et al., 2023). In generative LLM-based evaluation, error annotations are structured as sets of character or token spans, each with attached severity (Lyu et al., 8 Dec 2025).

2. Architectures and Learning Paradigms

2.1 Transformer Sequence Taggers

In GEC, ESD is typically cast as a token-level sequence labeling problem. A standard implementation is a RoBERTa-based transformer encoder, where each token or subword is embedded and contextualized through multiple self-attention layers. The final representations $(h_t)$ are consumed by a linear classifier (softmax or CRF head), predicting the BIO tag for each input position (Chen et al., 2020). A conditional random field (CRF) layer can be applied on top to capture label dependencies, with the output probability

$P(y|x) = \frac{\exp(\text{score}(x,y))}{Z(x)} ,\quad Z(x)=\sum_{y'\in\mathcal{Y}^n} \exp(\text{score}(x,y'))$

where the global score aggregates transition and emission potentials.

2.2 Multi-Task Encoder-Only Models

xCOMET (Guerreiro et al., 2023) integrates ESD into a multi-task, encoder-only architecture using XLM-R as the backbone. For each input triple (translation, source, reference), the model yields both a pooled sentence representation (for global quality) and per-token logits over the error severity classes. Cross-entropy and mean-squared error losses are combined with data-driven balancing to optimize both granular ESD and overall segment evaluation.

2.3 Generative LLMs

Recent MT evaluation research employs generative LLMs for ESD by prompting the model to output error annotations directly (e.g., GEMBA-MQM–style JSON). The task is formulated as generating the most likely annotation $E$ conditioned on system inputs $x$ (Lyu et al., 8 Dec 2025). Decoding strategies such as Maximum a Posteriori (MAP) and Minimum Bayes Risk (MBR) are used to select the final annotation.

3. Training Regimes and Data Annotation

ESD systems are trained on large-scale corpora, often with a staged curriculum:

GEC: Pretrained on 256M synthetic sentences (random noising + backtranslation), fine-tuned on curated datasets (FCE, Lang-8, NUCLE, W&I+LOCNESS) (Chen et al., 2020).
xCOMET: Warm-up on direct assessment (DA) data, intensive MQM word-level supervision, and further sentence-level fine-tuning (Guerreiro et al., 2023).
Generative LLMs: Zero- or few-shot ESD on WMT MQM annotations, with additional DPO-based distillation to align greedy decoding with MBR-optimal decisions (Lyu et al., 8 Dec 2025).

All systems rely on gold span-level annotations, with severity sourced from MQM for MT or via alignment in GEC.

4. Decoding and Inference Strategies

ESD output is determined via different decoding strategies:

Sequence Taggers: For softmax outputs, labels are assigned by $\hat{y}_t = \arg\max_k P(y_t = k | x)$ . For CRF, the Viterbi algorithm is applied with complexity $O(n|\mathcal{Y}|^2)$ (Chen et al., 2020).
Span Grouping: Adjacent non-OK tags are merged into error spans; each span’s severity is defined as the maximal token severity contained (Guerreiro et al., 2023).
MBR Decoding (MT Evaluation): Instead of MAP, ESD outputs minimize expected loss under a utility function (e.g., SoftF1 or ScoreSim) over sampled support hypothesis sets:

$\hat{E}^{MBR} = \arg\max_{E_c\in\mathcal{C}_{ESD}} \frac{1}{|S_{ESD}|} \sum_{E_s\in S_{ESD}} u(E_c, E_s)$

MBR improves correlation with human annotation compared to MAP (Lyu et al., 8 Dec 2025).

Distillation: Due to the computational burden of MBR (requiring $N$ model runs), DPO-based distillation fine-tunes the LLM so that a greedy pass recreates MBR-optimal ESD decisions (Lyu et al., 8 Dec 2025).

5. Evaluation Metrics and Empirical Results

5.1 Token/Character-Level Precision and Recall

GEC ESD uses token-level precision, recall, and $F_{0.5}$ metrics, matching prior grammatical error detection standards. For instance, a RoBERTa-base + pretrained ESD system achieves $F_{0.5}=55.6$ on CoNLL-14’s second annotation (Chen et al., 2020).

MT-based ESD, as in xCOMET and MBR evaluation, computes span-level F1 at the character-level, accounting for both partial and complete matches (Guerreiro et al., 2023, Lyu et al., 8 Dec 2025).

5.2 Severity-Aware Aggregates

Error counts are weighted by severity to derive an aggregate penalty

$e(S) = C_{\text{min}} + 5 \times C_{\text{maj}} + 10 \times C_{\text{crit}}$

with an inferred MQM score normalized to $[0,1]$ (Guerreiro et al., 2023). This provides an interpretable mapping from local ESD outputs to global quality metrics.

5.3 Empirical Performance

GEC ESD: Pretrained base models reach $F_{0.5}=55.6$ , with >75% recall on short (1–2 token) errors. Multi-token and idiomatic errors remain more challenging (Chen et al., 2020).
xCOMET: On WMT23 QE test sets, xCOMET-XXL achieves character-level F1=0.257 (major+minor), exceeding GPT-3.5 and on par with GPT-4 (Guerreiro et al., 2023).
MBR ESD: MBR-SoftF1 yields consistent gains over MAP decoding across system, sentence, and span levels, e.g., SoftF1=.932 with N=256 candidates, and SPA=.848 vs .823 for MAP baseline (Lyu et al., 8 Dec 2025).

Model	Span F1	System PA	Sentence Acc*
xCOMET-XXL	.257	.82	—
Llama-MBR-SoftF1	.932	.848	.571

PA: Soft Pairwise Accuracy; Acc: calibrated pairwise accuracy.

MBR-SoftF1 also aligns error distributions more closely with human annotation, reducing over-generation of major errors relative to MAP decoding (Lyu et al., 8 Dec 2025).

6. Robustness, Ablations, and Limitations

xCOMET and related models have undergone robustness analysis using synthetic perturbations and hallucination injection (Guerreiro et al., 2023). xCOMET detects >90% of major/critical localized errors and achieves high AUROCs on hallucination benchmarks. MBR ablations show that candidate set diversity is crucial: performance gains vanish for small N; improvements plateau beyond N=256 (Lyu et al., 8 Dec 2025). Span-level metrics based on hard F1 can penalize non-overlapping predictions excessively, motivating smooth alternatives like SoftF1.

Key limitations include the lack of explicit error-type annotation (most systems yield only severity labels), susceptibility to subword tokenization artifacts, incomplete coverage of low-resource languages, and inference latency for multi-pass or MBR models (Guerreiro et al., 2023, Lyu et al., 8 Dec 2025). Ongoing work explores integrated error-type classification, more precise span boundary labeling (e.g., BIO tagging on subwords), and lower-latency architectures (e.g., adapter-style unary encoders or distillation techniques).

7. Applications and Impact

ESD supports a range of applications:

Efficient GEC: By restricting detailed correction to spans pre-localized by ESD, systems reduce inference cost by 2–3× compared to full-sentence seq2seq models, with comparable accuracy (Chen et al., 2020).
MT Evaluation: ESD enables sentence- and system-level evaluation metrics driven by interpretable error localization and severity weighting, facilitating detailed system debugging and quality estimation without scalar collapse (Guerreiro et al., 2023, Lyu et al., 8 Dec 2025).
Automation and Post-Editing: ESD’s fine-grained output supports semi-automatic post-editing workflows where only localized spans are flagged and routed downstream.
Transparency and Robustness: The interpretability of error spans empowers error analysis, robustness studies, and stakeholder-facing reporting, as exemplified in xCOMET’s color-coded visualization interface (Guerreiro et al., 2023).

A plausible implication is that as ESD frameworks grow in scope, they will become increasingly central to the evaluation and deployment of high-stakes NLP systems, enabling diagnostic insight absent from scalarized metrics. Continuous advances in utility-driven decoding and severity-calibrated annotation are likely to yield further gains in both accuracy and practical usability.