Error Span Assessment (ESA)

Updated 16 April 2026

Error Span Assessment is a method for identifying contiguous error segments by annotating spans with specific types and severity levels.
It streamlines error analysis in machine translation and computational models, reducing annotation time while enhancing diagnostic accuracy.
ESA frameworks provide a structured workflow for both human and AI-assisted evaluations, enabling efficient export and model quality assessment.

Error Span Assessment (ESA) is a family of protocols and methodologies for localizing, quantifying, and categorizing errors in both human and machine-generated outputs. Across machine translation (MT), computational modeling, and algorithmic learning theory, ESA principles provide granular diagnostic information beyond scalar error metrics. ESA frameworks ask annotators or systems to identify contiguous segments ("spans") in a candidate output that manifest errors, optionally categorize those errors, assign severity ratings, and aggregate this information for system evaluation and downstream model training.

1. Formal Definition and Protocol Variants

In the modern MT context, Error Span Annotation (ESA) is defined by marking contiguous subsequences $[i:j]$ of the hypothesis $T = \{t_1, t_2, ..., t_M\}$ such that tokens $t_i,...,t_j$ are considered jointly erroneous. Each indicated span $S=[i:j]$ is annotated with a categorical error type $C$ and severity $\sigma$ , where

$(S,\;C,\;\sigma)$

encodes span indices, category, and severity—typically "Minor" or "Major". Omission errors are annotated via a special [MISSING] token, and optionally a corresponding source-side span may be recorded for alignment (Wasti et al., 23 Jun 2025, Kocmi et al., 2024).

ESA in earlier domains (e.g., computational chemistry) refers to the aggregation of error components into a physically meaningful uncertainty interval or "error span" that accounts for both systematic and random errors. For instance: $\text{Error Span} = [c - k \sigma_{\text{tot}}, c + k \sigma_{\text{tot}}]$ with $k$ chosen to reflect the desired confidence level and $\sigma_{\text{tot}}$ the total combined uncertainty (Simm et al., 2017).

Weighted SVM span-based error assessment extends the "span bound" and "span-rule"—upper bounds and estimators of prediction error—based on geometry in feature space and support vector contributions (Sarafis et al., 2018). These yield computationally efficient alternatives to cross-validation for error estimation and hyperparameter selection.

2. Error Taxonomy and Classification Schemes

ESA frameworks vary in the granularity of error classification. Machine translation protocols inspired by MQM prescribe a compact error taxonomy for each error span (Wasti et al., 23 Jun 2025):

Category	Definition
Addition	Extra content not present in the source
Omission	Missing content from the source
Mistranslation	Incorrect rendering of source meaning
Untranslated	Source left verbatim in target
Grammar	Target-language grammatical violation
Spelling	Misspelled word(s)
Typography	Punctuation/capitalization/spacing errors
Unintelligible	Garbled or nonsensical output

Each span must be assigned a category and severity (Minor/Major). By contrast, minimalist ESA implementations forgo error categories entirely and collect only the severity level, motivated by annotation efficiency and reliability (Kocmi et al., 2024, Zouhar et al., 2024).

3. Annotation Workflows and Human-Computer Interaction

ESA annotation protocols prioritize the minimization of annotator effort and cognitive load. The typical workflow in systems such as TranslationCorrect comprises the following stages (Wasti et al., 23 Jun 2025, Zouhar et al., 2024):

Presentation: Source and hypothesis are displayed in a specialized interface (e.g., "Database View" or Appraise tool).
Suggestion: Automated error detection models (e.g., XCOMET, LLM-based assistants) may pre-fill candidate error spans, colored by category, with tooltips providing rationale and severity proposals.
Correction: Human annotators review, accept, modify, or delete system-suggested spans, and may mark new error spans. Categories and severities are set interactively, sometimes with right-click context menus for rapid edits.
Scoring: Where implemented, an overall sentence-level quality score $T = \{t_1, t_2, ..., t_M\}$ 0 is assigned via slider, often with anchor guidelines (e.g., 0% "no meaning preserved", 100% "perfect translation").
Export: The protocol exports ESA-formatted records, typically as JSON or CSV, comprising source, hypothesis, corrected text, and all $T = \{t_1, t_2, ..., t_M\}$ 1 span annotations.

Human-computer interaction design emphasizes color-coding for preattentive processing, dark themes to reduce fatigue, and minimization of mouse travel and modal dialogs (Wasti et al., 23 Jun 2025).

AI-assisted ESA variants, notably ESA $T = \{t_1, t_2, ..., t_M\}$ 2, employ high-recall automatic Quality Estimation (e.g., GEMBA, a GPT-4-based QE) to pre-fill error spans, yielding approximately 56% reduction in per-span annotation time ("71s/error span $T = \{t_1, t_2, ..., t_M\}$ 3 31s/error span") with no significant automation bias detected. Annotator agreement increases in the AI-assisted setting, and further budget savings ( $T = \{t_1, t_2, ..., t_M\}$ 424%) are achievable by omitting segments with no predicted errors (Zouhar et al., 2024).

4. Evaluation Metrics and Statistical Agreement

ESA evaluation decomposes into two principal dimensions: error span detection and span classification (where applicable). A predicted span $T = \{t_1, t_2, ..., t_M\}$ 5 is a true positive if it overlaps any gold span $T = \{t_1, t_2, ..., t_M\}$ 6 (strict-overlap). Main metrics (Wasti et al., 23 Jun 2025, Lyu et al., 13 Mar 2026):

Span-level Precision, Recall, $T = \{t_1, t_2, ..., t_M\}$ 7: $T = \{t_1, t_2, ..., t_M\}$ 8

$T = \{t_1, t_2, ..., t_M\}$ 9

$t_i,...,t_j$ 0

Category-wise Metrics: Analogously defined for each error category $t_i,...,t_j$ 1.

For automated Error Span Detection, the SoftF1 metric is favored since it accounts for partial overlaps between predicted and reference spans.

Empirical studies on MQM, ESA, and DA protocols (WMT23 En→De) report that ESA attains segment ranking correlations (Spearman ρ ≈ 0.987) on par with MQM but at 30% lower annotation time and with twice the ranking reliability of direct assessment (Kocmi et al., 2024). Span overlap between ESA and MQM is typically in the 70–85% range.

5. Applications in Machine Translation and Beyond

ESA is the standard protocol for fine-grained human evaluation in MT when both diagnostic error localization and system-level quality ranking are required (Kocmi et al., 2024). Annotator studies reveal that ESA halves the need for MQM experts, supports deployment by general bilingual speakers, and preserves system orderings across protocols.

The ESA format is machine-readable and directly usable for supervised training of error span detection models or for benchmarking automatic quality estimation systems. Automatic ESD models, such as those trained via Iterative Minimum Bayes Risk (MBR) Distillation, can achieve system-level and span-level performance (e.g., SPA = 0.864, SoftF1 = 0.933) that surpasses supervised baselines using only synthetic pseudo-labels, thus reducing reliance on costly human annotation (Lyu et al., 13 Mar 2026).

In computational chemistry, ESA entails decomposition and summation of all systematic and random error components in a predictive model to establish an uncertainty "error span" for observables, thereby supporting confidence reporting and calibration transfer (Simm et al., 2017).

In machine learning theory, the concept of an "error span" endows the span-rule and span-bound with efficient, geometrically-motivated estimators of leave-one-out or generalization error, particularly for weighed SVMs (Sarafis et al., 2018).

6. Limitations, Strengths, and Future Directions

ESA sacrifices the full granularity of MQM-style hierarchical error taxonomies for efficiency but loses potential diagnostic power; protocols recording only "Minor"/"Major" severity cannot directly distinguish error types without a secondary pass (Kocmi et al., 2024). The segmentation of non-space-delimited scripts requires interface adaptations for character-level labeling.

Automated and semi-automated ESA opens opportunities for cost-efficient, high-recall quality estimation but is subject to the recall/precision tradeoffs of its underlying QE models. Iterative self-evolving distillation methods (e.g., MBR) suggest potential for reducing or eliminating human annotation entirely, though limitations remain in sample diversity and dependence on the base model's initial ESD ability (Lyu et al., 13 Mar 2026).

A plausible implication is that ESA, particularly when AI-assisted, may become the dominant human–in–the–loop evaluation standard across MT shared tasks. For entirely automated pipelines, ESA-style strict overlap metrics and soft span utilities offer robust, comparable benchmarks for error localization models.

7. Illustrative Examples

Representative ESA-annotated MT segment:

Span (indices)	Text segment	Error Category	Severity	Correction
[1:1]	"Todayen"	Spelling	Minor	"Hoy"
[7:7]	"parce" (fr)	Typography	Minor	"car"

These compact records focus on explicit span, error type, severity, and a minimal correction, and can be exported in JSON or CSV for training or evaluation purposes (Wasti et al., 23 Jun 2025).

In computational chemistry:

Given a computed value $t_i,...,t_j$ 2 and calibration uncertainties $t_i,...,t_j$ 3, ESA outputs $t_i,...,t_j$ 4 for predictive reporting (Simm et al., 2017).

In weighted SVM hyperparameter selection:

The span-rule uses computed $t_i,...,t_j$ 5 (geometry-derived span of support vectors) to estimate leave-one-out error without retraining, with empirical superiority to $t_i,...,t_j$ 6-fold CV in both efficiency and test error prediction (Sarafis et al., 2018).

This structure and protocol standardize error localization and severity scoring across domains, supporting transparent, diagnostic, and reproducible system evaluation and model selection.