MQM-Annotated Datasets for MT Quality Evaluation
- MQM-annotated datasets are structured resources that evaluate MT outputs using hierarchical error categories and severity levels for granular quality assessment.
- They utilize standardized annotation schemas and rigorous workflows, including quality control and re-annotation, to enhance inter-annotator agreement and benchmark MT systems.
- These datasets support meta-evaluation and development of automatic metrics across diverse domains such as biomedical, news, and emotion translation, driving improvements in MT evaluation.
Multidimensional Quality Metrics (MQM)–annotated datasets are structured resources used for the fine-grained human assessment of machine translation (MT) outputs according to the MQM framework, which decomposes translation quality into hierarchical error categories, each with associated severity levels. These datasets are foundational for benchmarking, meta-evaluation, and the development of both manual and automatic MT evaluation methodologies, capturing not only overall quality but also detailed dimension-specific error distributions across a range of language pairs, domains, and genres.
1. Definition and Rationale
MQM-annotated datasets consist of corpora in which MT outputs are evaluated by expert annotators who mark error spans, assign hierarchical error categories (e.g., Accuracy: Mistranslation, Omission; Fluency: Grammar, Spelling), and label severities according to a scheme standardized in the MQM framework. The rationale for these datasets is to overcome the limitations of scalar or n-gram overlap metrics (e.g., BLEU), providing interpretable and granular error signals that track phenomena such as domain-specific terminology, morphosyntactic accuracy, and style errors, and thereby allowing more robust assessment of MT systems and the metrics that evaluate them (Zouhar et al., 2024, Sai et al., 2022).
2. Dataset Construction Methodologies
Annotation Schema
The core MQM annotation schema, derived from Lommel et al. (2014), consists of:
- Error categories: Structured hierarchically, typically covering Accuracy (Mistranslation, Addition, Omission, Untranslated), Fluency (Grammar, Punctuation, Spelling, Register), Terminology, Style, Locale conventions, among others.
- Severity levels: At minimum Minor and Major, with additional levels such as Neutral or Critical (which flags errors rendering a segment unusable or dangerous, e.g. in biomedical translation (Zouhar et al., 2024)).
Customizations to the schema are made for specific tasks or domains—for example, restricting annotation to the Accuracy dimension for emotion translation to focus on semantic preservation (Qian et al., 2023), or omitting rare categories such as Design & Markup in English–Korean evaluation (Park et al., 2024).
Annotation Process
A typical workflow includes:
- Reference Creation: Professionally re-translating or cleansing references to ensure a clean comparison baseline (as in biomedical MQM (Zouhar et al., 2024)).
- Primary Annotation: Expert annotators mark error spans in context, following MQM guidelines, often working at the document level to leverage discourse context (Zouhar et al., 2024).
- Quality Control and Re-Annotation: Some datasets include a re-annotation phase, where errors in initial annotations—whether produced by humans or automatic systems—are reviewed and revised, which has been shown to increase error detection and agreement (Riley et al., 28 Oct 2025). In re-annotation, annotators treat prior error spans as suggestions and are instructed to modify, delete, or augment as required.
Annotator selection typically requires native or near-native proficiency, domain expertise, and prior calibration (e.g., MQM quizzes, pilot batches (Sai et al., 2022, Zouhar et al., 2024)). Inter-annotator agreement is monitored, with normalization or adjudication procedures (e.g., per-annotator z-normalization (Zouhar et al., 2024)) applied to counteract high variance.
3. Representative MQM-Annotated Datasets
Biomedical Domain: Large Multilingual Benchmark
- Coverage: 11 language pairs (e.g., Pt→En, De↔En, Es↔En, Ru↔En, Fr↔En, Zh↔En)
- Size: 25,000 segments annotated with full MQM schema, including the “Critical” severity
- Characteristics: Biomedical abstracts from MEDLINE; high domain specificity, emphasis on terminology and critical errors (8% Critical), 44% Major, 31% Minor, 66% of errors are Fluency-related; 72% of segments error-free (Zouhar et al., 2024).
- Repository: https://github.com/amazon-science/bio-mqm-dataset
News Domain: Comparative Judgment MQM
- Language pairs: Zh→En, En→De (WMT2023 news)
- Annotations: Triply annotated under three protocols—point-wise MQM, side-by-side MQM (SxS MQM), and side-by-side relative ranking (SxS RR)
- Format: JSON per segment, segment-level system scores, context kept for document-level coherence
- Use: Analysis of agreement and protocol efficiency; SxS approaches significantly increase inter-translation consistency and agreement (Song et al., 25 Feb 2025)
- Repository: https://github.com/google/wmt-mqm-human-evaluation/tree/main/generalMT2023
Domain- and Language-Specific Datasets
- IndicMT-Eval: 7,000 fine-grained MQM annotations for five Indian languages (ta, gu, hi, mr, ml) (Sai et al., 2022); used for meta-evaluating 16 automatic metrics and training Indic-language-specific metrics (e.g., Indic-COMET).
- English–Korean MQM: 1,200 annotated segments, with scores along Accuracy, Fluency, and Style axes separately (Park et al., 2024); baseline MQM-predictor models trained and evaluated on this resource.
- Emotion Translation Evaluation: 5,500 Chinese→English microblog posts, annotated only for Accuracy errors affecting emotion preservation, with severity tailored to semantic shift (Critical, Major, Minor) (Qian et al., 2023).
A summary table of selected key datasets:
| Dataset/Domain | Lang. Pairs / Size | Key Features |
|---|---|---|
| Bio-MQM (Zouhar et al., 2024) | 11 pairs / ~25,000 segs | Domain-specific, Critical severity, z-norm |
| Song et al. (Song et al., 25 Feb 2025) | 2 pairs / 481 segs, 3x3 annos | Triply annotated, SxS judgment |
| IndicMT-Eval (Sai et al., 2022) | 5 Indic pairs / 7,000 | Fine PS, 7 engines, public domain |
| EN-KO MQM (Park et al., 2024) | 1 pair / 1,200 | Dimension-wise scores, style, fluency |
| C-E Emotion (Qian et al., 2023) | 1 pair / 5,500 | Emotion-specific, only Accuracy |
4. Inter-Annotator Agreement and Reliability
Inter-annotator agreement is a persistent challenge, reflecting the subjectivity and richness of the task. Reported agreement statistics (per dataset and protocol):
- Biomedical MQM: High inter-annotator variance in raw scores; per-annotator z-normalization applied to stabilize metric training (Zouhar et al., 2024).
- Comparative Protocols: Krippendorff’s α: MQM ~0.22–0.23, SxS MQM up to 0.36 (Song et al., 25 Feb 2025). Segment-level pairwise ranking agreement (PRA) and span-level F₁ are used to measure consistency pre/post re-annotation (Riley et al., 28 Oct 2025).
- IndicMT-Eval: Kendall’s τ ≈ 0.52–0.61 between two experts per segment (Sai et al., 2022).
- EN-KO MQM: Kendall’s τ = 0.54 (Accuracy), 0.57 (Fluency), 0.34 (Style) for primary vs. cross-validator (Park et al., 2024).
- Emotion (C-E): Cohen's κ for error existence 0.669 (inter-AA), 0.899 (intra-AA) (Qian et al., 2023).
Re-annotation, side-by-side protocols, and segment-level normalization are documented for increasing reliability and consistency.
5. Data Structure, Formats, and Access
MQM-annotated datasets are typically distributed as:
- Primary format: JSON or TSV, encoding for each segment: source text, reference (when available), MT output, annotated error spans (start, end, category, severity), and computed quality score.
- Protocols: Annotation guidelines, error hierarchy, and supporting code are often included for reproducibility (cf. (Song et al., 25 Feb 2025, Sai et al., 2022, Zouhar et al., 2024)).
- Licensing: Ranges from CC-0 (IndicMT-Eval), public research (Bio-MQM, C-E Emotion), to open-source repository terms (EN-KO MQM, SxS MQM).
- Metadata: Additional CSVs or indices may summarize segment statistics (length, token count, mean score).
6. Application to MT Metric Development and Evaluation
MQM-annotated benchmarks are crucial for:
- Direct meta-evaluation: Correlating automatic metric scores (BLEU, ChrF, BERTScore, COMET, BLEURT, Prism, etc.) with human MQM references at segment and system levels; correlation measures include Pearson’s r, Spearman’s ρ, and Kendall’s τ.
- Analysis under domain shift: Studies with biomedical MQM demonstrate that fine-tuned neural metrics (COMET, BLEURT) substantially degrade when evaluated on out-of-domain data, unlike surface-form or pre-trained metrics, highlighting the importance of domain-specific MQM corpora (Zouhar et al., 2024).
- Training/fine-tuning: Small amounts of in-domain MQM data (~1,000–6,000 judgments) can recover much lost performance for fine-tuned metrics (Zouhar et al., 2024).
Illustrative findings:
- Metrics show the highest correlation with human MQM on accuracy errors; fluency and style remain challenging, especially in morphologically rich or low-resource languages (Sai et al., 2022, Park et al., 2024).
- Comparative and collaborative protocols (side-by-side and re-annotation) enhance error detection and improve agreement, making them attractive for high-stakes and ambiguous MT settings (Song et al., 25 Feb 2025, Riley et al., 28 Oct 2025).
7. Limitations and Future Directions
Limitations include:
- Domain and language scope: Most resources lack coverage for low-resource languages, informal genres, or domain-specific text (e.g., social media, legal, literary). This constrains external validity and metric generalization (Sai et al., 2022, Qian et al., 2023).
- Annotation cost: Expert MQM annotation is resource-intensive; thus, most datasets remain limited in size relative to automatic corpora.
- Reference quality: The quality of reference translations can bias annotation outcomes (Sai et al., 2022).
Future research directions suggested in the literature:
- Expanding MQM resources to new language pairs and domains (notably low-resource, morphologically rich, and stylistically diverse settings) (Sai et al., 2022, Park et al., 2024).
- Leveraging collaborative annotation, re-annotation, and hybrid human–automatic pipelines to scale high-quality MQM data (Riley et al., 28 Oct 2025).
- Incorporating dimension-weighted or task-specific adaptations of MQM for specialized evaluation (emotion, sentiment, safety-critical domains) (Qian et al., 2023, Park et al., 2024).
- Exploring cross-lingual and transfer learning for MQM predictor models to reduce the annotation burden through knowledge sharing across languages (Park et al., 2024).
MQM-annotated datasets thus serve as robust, extensible resources for multidimensional MT evaluation, supporting rigorous human benchmarking as well as the development of next-generation automatic evaluation methodologies optimized for error awareness, domain robustness, and granular interpretability.