Error Span Annotation in NLG Evaluation

Updated 2 January 2026

Error Span Annotation (ESA) is a methodology that localizes, classifies, and weights erroneous text spans, enabling detailed evaluation and training for NLG models.
ESA employs formal schemas, detailed protocols, and both human and AI-assisted workflows to achieve precise error detection with reduced annotation cost.
ESA enhances model fine-tuning by integrating weighted loss functions and benchmarked evaluation metrics that improve translation quality and post-editing accuracy.

Error Span Annotation (ESA) is a methodology for localizing, classifying, and weighting erroneous spans within text, primarily for the evaluation, training, and diagnosis of Natural Language Generation tasks such as Machine Translation (MT) and LLM outputs. ESA bridges the granularity of full error taxonomies—such as the MQM standard—and the simplicity of global quality scores, enabling fine-grained signal for both human evaluation and model learning at manageable annotation cost and cognitive load (Kocmi et al., 2024, Zhang et al., 2024, Kasner et al., 11 Apr 2025, Chen et al., 2020). ESA encompasses protocol design, formal span schemas, loss functions for model training, and standardized evaluation metrics.

1. Formalization, Protocols, and Schemas

ESA operates over a sequence or pair of sequences (e.g., source S and translation T) by identifying a set of error spans. Each error span is a contiguous substring in the output (and possibly also in the source), labeled with severity and, optionally, with an error type and rationale.

The ESA annotation schema (as formalized by TranslationCorrect and related pipelines) is as follows (Wasti et al., 23 Jun 2025):

Field	Type	Description
source_text	string	Original input segment S
mt_text	string	System output segment T
error_spans	list	List of error-span records, each with:
– start_index_trans	integer	0-based char offset in T (start)
– end_index_trans	integer	0-based char offset in T (exclusive end)
– error_type	string	Category from a fixed set (e.g. Omission, Grammar, Addition, etc.)
– error_severity	string	“Minor” or “Major”
– description	string	Brief human-readable explanation or correction hint

Additional fields may capture source alignment, associated spans, or minimal corrections (He et al., 6 Mar 2025).

Annotation is governed by detailed guidelines: errors are marked as minimal spans covering the mistake, with careful differentiation by severity. Categories are mapped to either coarse-grained tags (as in base ESA (Kocmi et al., 2024)) or more refined MQM-derived taxonomies (Addition, Omission, Mistranslation, etc.) (Wasti et al., 23 Jun 2025).

For tasks beyond MT, ESA generalizes to span-level error detection in GEC (Chen et al., 2020), hallucination identification in data-to-text (Kasner et al., 11 Apr 2025), and compositional error taxonomies in PLM-generated text (He et al., 6 Mar 2025).

2. Human and AI-Assisted Annotation Workflows

Human ESA protocols are exemplified by Appraise-style interfaces in which annotators:

Read source and translation side-by-side.
Highlight each erroneous span, marking “Major” or “Minor” severity (or applicable category).
Use a special “[MISSING]” tag for omissions.
Assign a direct [0,100] quality score after marking errors (Kocmi et al., 2024).

AI-assisted ESA leverages high-recall QE models (e.g., GPT-4 in few-shot mode) to pre-populate candidate spans. Annotators then post-edit these suggestions—removing, resizing, re-categorizing, or confirming as needed. This pipeline, as validated in ESA\textsuperscript{AI}, halves per-span annotation time (71s→31s) and allows for further budget reduction by omitting segments with no predicted errors (≈24% savings) (Zouhar et al., 2024). Automation bias is minimal, as annotators do not over-rely on imperfect AI hints, and quality-control perturbations show comparable error detection rates to human-only annotation.

LLMs achieve inter-annotator agreement on par with human crowdworkers and can serve as cost-effective, scalable span annotators in a range of tasks (MT error detection, propaganda, data-to-text) when provided with appropriate zero-shot or few-shot structured-output prompts (Kasner et al., 11 Apr 2025).

3. ESA in Model Training and Fine-Tuning

ESA not only supports evaluation but is a cornerstone of novel fine-tuning objectives for NLG models. The Training with Annotations (TWA) framework integrates span-level error signals into MT model learning (Zhang et al., 2024):

Weights: Each error span $S_i$ identified in the output is assigned a weight $w_i$ (e.g., $-5$ for major, $-1$ for minor, $-0.1$ for minor punctuation, per MQM mapping). Tokens in $S_i$ inherit $w_i$ . Tokens before the first error receive $+1$ ; tokens after the first error are given weight $0$ as “off-trajectory.”
Loss Functions: For error spans ( $w < 0$ ), a weighted unlikelihood loss is applied:

$L_{\mathrm{error}}(S) = -|w| \cdot \log\left[1 - \prod_{t \in S} p(y_t | y_{<t}, x)\right]$

For non-error spans before the first error ( $w = +1$ ), standard cross-entropy is used. Spans with weight zero are skipped.

Trajectory Cut-off: Only "on-trajectory" spans (before first error) are positively reinforced.

Ablation studies confirm that training with error token penalties and off-trajectory filtering yields the largest accuracy boosts. TWA outperforms supervised fine-tuning on filtered data and sequence-level preference optimization baselines across En $\to$ De and Zh $\to$ En WMT23 MT tasks (Zhang et al., 2024).

4. Evaluation Metrics and Benchmarking

ESA protocols utilize matched-pair and aggregate scoring for system comparison. Key metrics include:

Segment Score: Direct [0,100] annotation or MQM-like sum (e.g., $-5\cdot$ (#major) $-1\cdot$ (#minor), though this is unbounded and less favored in practical ESA) (Kocmi et al., 2024).
Span Detection: Token-level or span-level precision, recall, and $F_{0.5}$ (or $F_1$ as appropriate). Hard and soft metrics are computed, with the latter ignoring error type (Chen et al., 2020, He et al., 6 Mar 2025).
Agreement: Intra/inter-annotator agreement measured with Kendall's $\tau^c$ , Pearson's $r$ , and recall of major errors for reliability quantification (Kocmi et al., 2024, Kasner et al., 11 Apr 2025).
System Ranking: Pairwise accuracy, Spearman correlation, and significance clustering vs. WMT gold (Kocmi et al., 2024, Zouhar et al., 2024).

ESA-based human protocols provide higher inter- and intra-annotator agreement and faster annotation times than full MQM, with no loss in system-level accuracy (Kocmi et al., 2024, Zouhar et al., 2024).

ESA data has enabled the development of benchmarked span-level error detectors, using metrics such as average precision (AP) and precision at top-k. Span alignment is crucial for aligning predicted and gold error spans in evaluation protocols (Klie et al., 2022).

5. ESA in Automated Error Detection and Correction

ESA has driven innovations in model-based error span identification and correction:

Erroneous Span Detection and Correction (ESD/ESC): A pipeline of sequence-tagging to mark error spans (using the BIO scheme) followed by span-targeted correction (seq2seq over bracketed error tokens), which yields comparable GEC accuracy to monolithic seq2seq models but with up to 2.5 $\times$ faster inference (Chen et al., 2020).
LLM-Based Automatic ESA & Filtering: LLMs (GEMBA-MQM, MQM-APE) can localize and label error spans with MQM-style categories. By leveraging automatic post-editing (APE) and quality differentials, only “impactful” spans that improve an external metric (e.g., CometKiwi, BLEURT) are retained, improving interpretability and alignment with human annotation (Lu et al., 2024). Segment-level and system-level improvements, as well as qualitative error-type distribution shifts (less over-generation of minor/style errors), are demonstrated.
Automatic Benchmarks: ESA protocols have enabled comprehensive diagnostic tasks (error detection, error-type classification, rationale generation) in PLM evaluation, as instantiated in TGEA (He et al., 6 Mar 2025).

6. Limitations, Open Problems, and Future Directions

Despite strong empirical results, several challenges in ESA remain (Klie et al., 2022, Wasti et al., 23 Jun 2025, Kocmi et al., 2024):

Boundary alignment and detection of missing spans remain difficult—most metrics assume perfect gold spans, and misaligned boundaries can dominate F-scores.
Error ambiguity and label inconsistency complicate ground-truth definitions; soft-labeling or probabilistic disagreement modeling is proposed as an avenue.
Span-type classification is not present in base ESA protocols (e.g., Appraise) but is being added via MQM-inspired taxonomies for more diagnostic feedback (Wasti et al., 23 Jun 2025).
Scaling and cost are under continuous optimization via AI assistance, automation, and crowd-sourcing with reduced guideline complexity (Zouhar et al., 2024, Wasti et al., 23 Jun 2025).
Generalization to new domains, low-resource languages, or structurally distinct NLG tasks (hallucination detection, factual error identification) is ongoing, with proposed frameworks adopting the ESA schema for task adaptation (Zhang et al., 2024, Kasner et al., 11 Apr 2025).
Integration with correction and semi-automatic post-editing pipelines is emerging as a best-practice in research and industry annotation tools (Wasti et al., 23 Jun 2025, Chen et al., 2020).

ESA has established itself as a critical protocol for bridging fine-grained error localization and efficient annotation, powering both robust human evaluation and the next generation of error-aware model training paradigms. Theoretical generality, cross-domain applicability, and empirical cost-efficiency continue to drive broad adoption and methodological innovation.