WMT25 Translation Evaluation Shared Task

Updated 16 April 2026

WMT25 is a centralized benchmark that systematically tests multilingual MT system outputs, quality estimation, and automated evaluation metrics.
It integrates robust data preprocessing, parameter-efficient model fine-tuning, and diverse evaluation strategies, including human-in-the-loop assessments, to address both high- and low-resource language challenges.
Outcomes from WMT25 highlight significant performance gains from synthetic data generation and quality-aware decoding, offering actionable insights for advancing MT systems.

The WMT25 Translation Evaluation Shared Task constitutes a centralized benchmarking initiative within the Workshop on Statistical Machine Translation (WMT), designed to catalyze improvements in both machine translation (MT) systems and their automated evaluation. It serves as the definitive annual testbed for multilingual MT, quality estimation, and evaluation methods, encompassing low-resource, high-resource, and novel language directions. At WMT25, the evaluation shared task focused on both the ranking of MT system outputs and the advancement of automatic and metric-based evaluation technologies, notably incorporating human-in-the-loop protocols and LLM-based assessment.

1. Task Structure and Participation

The WMT25 shared task was characterized by unprecedented scale and breadth, with 43 teams submitting systems (36 remaining after withdrawals/disqualifications), spanning both constrained (e.g., ≤20B parameters, public data only) and unconstrained categories (no model size/data restrictions) (Kocmi et al., 11 Aug 2025). Participants included academic groups, commercial entities, and large foundation model providers. The task comprised two primary foci:

General MT Evaluation: 32 language pairs, including high-resource (e.g., English↔Czech, English↔Chinese) and low-resource directions (e.g., English→Bhojpuri, English→Maasai), with test domains such as news, social media, speeches, and literary texts.
Translation Quality Estimation and Error Span Detection: Automatic and reference-based system-level and segment-level evaluation; dedicated subtracks for the prediction of error spans according to MQM or ESA annotation schemes (Juraska et al., 28 Oct 2025, Haq et al., 17 Sep 2025).

Systems had to translate ~37,000 words per test direction, organized as 100-word paragraphs grouped into documents. Output collection adopted a document-first approach, falling back to paragraphs if necessary.

2. Data Resources and Preprocessing Pipelines

Both the development of MT systems and the construction of evaluation pipelines leveraged carefully curated multilingual corpora:

Parallel Data: For high-resource languages, large-scale parallel corpora were available (e.g., OPUS, ParaCrawl, CCMatrix, WikiMatrix, EU-bookshop, MultiUN; up to billions of sentence pairs) (Gilabert et al., 18 Aug 2025). For low-resource settings (Slavic-language and regional tracks), parallel data were scarce, necessitating aggressive data filtering, similarity-based sentence retrieval, and generation of synthetic parallel data via back-translation or LLM-based translation (Saadi et al., 26 Sep 2025).
Monolingual and Synthetic Data: To augment data for scarce language pairs (e.g., Upper Sorbian, Lower Sorbian, and Ukrainian), monolingual corpora and synthetic parallel data generated by LLM translation were crucial. Synthetic pairs were produced by translating monolingual sentences using parameter-efficiently fine-tuned LLMs (e.g., LoRA-adapted Qwen2.5-3B) (Saadi et al., 26 Sep 2025). Quality and domain adaptation were maintained via retrieval-augmented filtering and in-domain finetuning.
Preprocessing: Corpus construction incorporated extensive cleaning—language detection, deduplication, normalization (e.g., punctuation, Unicode), and tokenization. Sentence embeddings (e.g., mean-pooled Qwen2.5-3B hidden states) facilitated similarity retrieval-based data curation, especially for low-resource development sets (Saadi et al., 26 Sep 2025).

The preprocessing and data curation pipeline enabled robust system development even for challenging language pairs with minimal native resources.

3. Evaluation Metrics, Aggregation, and Human Judgement

Evaluation methodology at WMT25 prioritized metric diversity and robust statistical treatment, employing both reference-based and referenceless paradigms:

Reference-based Metrics:
- BLEU: Standard n-gram overlap metric with brevity penalty,
$\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^N w_n \log p_n\right)$

where $p_n$ is n-gram precision and $w_n$ are typical uniform weights (Saadi et al., 26 Sep 2025, Kocmi et al., 11 Aug 2025). - chrF/ChrF++: Generalized F-score over character n-grams (augmented with word n-grams for ChrF++),

$\text{chrF} = \frac{(1+\beta^2) \cdot \text{Precision}_{\text{char}} \cdot \text{Recall}_{\text{char}}}{\beta^2 \cdot \text{Precision}_{\text{char}} + \text{Recall}_{\text{char}}}$

with β configurable, typically β=2 (Saadi et al., 26 Sep 2025, Kocmi et al., 11 Aug 2025). - COMET and Derivatives: Supervised reference-based regression models trained to predict human adequacy scores; COMETKiwi variants support referenceless quality estimation paradigms (Juraska et al., 28 Oct 2025, Gilabert et al., 18 Aug 2025, Kocmi et al., 11 Aug 2025).
LLM-based Judgement:
- GEMBA-ESA: LLM-as-judge scoring pipeline prompting GPT-4.1 and CommandA to rate adequacy and fluency in a reference-free mode, resulting in a 0–100 scale judgement (Kocmi et al., 11 Aug 2025).
Metric Aggregation:
- Disparate scales managed via median–interpercentile scaling and cross-metric averaging:
$z^\text{(m)}_s = \frac{x^\text{(m)}_s - \mu_m}{D_m}; \quad \bar{z}_s = \frac{1}{|M|}\sum_m z^\text{(m)}_s$

where $x^\text{(m)}_s$ is system s's average under metric m, $\mu_m$ is the median, $D_m$ the IQR, and $M$ is the set of metrics (Kocmi et al., 11 Aug 2025). Final ranks are linearly mapped such that best = 1, worst = N.
Human-in-the-loop Evaluation:
- The final official ranking is determined by human judges (error-span annotation) rather than automatic scoring, due to the known limitations and biases of metric optimization, re-ranking, and metric overfitting (Kocmi et al., 11 Aug 2025).

This diverse and statistically grounded evaluation protocol ensures resilience to metric-specific overfitting and enhances cross-system comparability.

4. System Architectures and Training Paradigms

WMT25 submissions encompassed a spectrum from traditional NMT to advanced instruction-tuned LLMs, with a strong emphasis on:

Foundation Model Adaptation: Systems such as SALAMANDRATA (2B/7B), In2x (LLaMA-3/70B style), Gemma 3 (12B/27B) were adapted via continual pretraining (CPT) on parallel data followed by supervised instruction tuning (Gilabert et al., 18 Aug 2025, Pang et al., 20 Aug 2025, Juraska et al., 28 Oct 2025).
Parameter-efficient Fine-tuning: Widespread use of LoRA (Low-Rank Adaptation) enabled efficient training on large LLM backbones even in resource-constrained environments, by parameterizing updates with $\Delta W = AB,\, A \in \mathbb{R}^{d\times r},\, B \in \mathbb{R}^{r\times k},\, r\ll\min(d,k)$ (Saadi et al., 26 Sep 2025).
Joint Task Learning: Several systems, such as JGU Mainz’s Qwen2.5-3B-Instruct adaptation, performed joint multi-task training (e.g., MT + QA), leveraging a shared LLM backbone (Saadi et al., 26 Sep 2025).
Curriculum and Synthetic Data: Instructional datasets were constructed via clustering, labeling, creative rewriting, and chain-of-thought enhancement. Synthetic data generation pipelines, such as Magpie/Self-Instruct with critic LLM validation, bootstrapped resources for low-resource languages (Pang et al., 20 Aug 2025).
Reinforcement Learning (RL): Fine-tuning objectives included custom RL reward models, such as rule-based or generative principle satisfaction (e.g., r_rule, r_gen), optimized via GRPO (Generalized PPO) (Pang et al., 20 Aug 2025).
Quality-aware Decoding: Minimum Bayes Risk (MBR) decoding and re-ranking with QE models (e.g., COMET-KIWI) were prominent strategies for output selection, offering systematic gains in COMET and other human-correlated metrics (Gilabert et al., 18 Aug 2025).

Through these methodologies, system performance improved markedly in both challenging low-resource and general settings.

5. Results, Ablations, and Analytical Insights

WMT25 produced extensive quantitative insights and methodological lessons:

Performance Gains:
- Parameter-efficient finetuning (LoRA) on small LLMs coupled with synthetic/back-translated data resulted in absolute ChrF++ gains exceeding 50 points for Slavic low-resource targets (Saadi et al., 26 Sep 2025).
- Instruction-tuning and quality-aware decoding (MBR, TRR) delivered up to +3.4 COMET points over CPT alone and a further +1.9 with MBR in the SALAMANDRATA pipeline (Gilabert et al., 18 Aug 2025).
- BLEU, chrF, and COMET improvements versus proprietary LLMs were observed in Japanese-related directions for In2x (e.g., BLEU en→ja: 35.2 vs. GPT-4.1’s 33.8; COMET en→ja: 0.72 vs. Gemini 0.68) (Pang et al., 20 Aug 2025).
Metric and Evaluation Analysis:
- Automatic metric aggregation reliably ranked systems but demonstrably favored those employing re-ranking and minimum-risk strategies, highlighting the importance of human evaluation for final system selection (Kocmi et al., 11 Aug 2025).
- Context-rich (long-span) training and evaluation (concatenated segments, weighted score aggregation) led to substantially higher Pearson correlations with human judgment across DA, SQM, and MQM labels, especially for MQM (e.g., +0.328 Pearson gain for COMET-22-LS vs. COMET-22) (Haq et al., 17 Sep 2025).
- Character-level F1 for generative error-span detection models approached or exceeded strong encoder-based baselines (e.g., GemSpanEval vs. XCOMET-XXL) in several language pairs (Juraska et al., 28 Oct 2025).
Ablations and Error Analyses:
- Adding in-domain QA finetuning can erode MT performance if not tuned with care (ChrF++ drop of –0.9 on DSB vs. gain of +0.1 on HSB) (Saadi et al., 26 Sep 2025).
- For zero-shot transfer to Bhojpuri, including Hindi data in CPT was found to be critical (BLEU: 9.32 with EN–HI data, 0.35 without) (Gilabert et al., 18 Aug 2025).
- LLM-based metric overfitting, paragraph-level scoring, and reference quality remain active challenges for metric reliability (Kocmi et al., 11 Aug 2025).

These analyses informed both methodological refinements and best practices.

6. Best Practices and Recommendations for Future WMT Evaluation

Key lessons and forward-looking guidance synthesized from WMT25 include:

Leverage parameter-efficient adaptation (e.g., LoRA) and synthetic data generation (10–100k examples) as first-order strategies for low-resource MT (Saadi et al., 26 Sep 2025).
Apply similarity-based retrieval over large monolingual and parallel corpora to construct high-yield, domain-tailored training sets (Saadi et al., 26 Sep 2025).
Utilize hybrid metric pipelines, combining reference-based, reference-less, and LLM-as-judge scores via robust scaling and cross-metric aggregation (Kocmi et al., 11 Aug 2025).
Conduct joint fine-tuning for multi-task LLMs (MT, QA, etc.), but iteratively tune in-domain adaptation rates to preserve task-specific performance (Saadi et al., 26 Sep 2025).
For evaluation, always report metric configurations, including SACREBLEU-tokenized BLEU and ChrF, and document the use of re-ranking or MBR decoding (Kocmi et al., 11 Aug 2025, Gilabert et al., 18 Aug 2025).
Emphasize human-evaluation results (e.g., error-span annotation) in system papers, as automatic metrics may not reflect true translation quality (Kocmi et al., 11 Aug 2025).

Collectively, these principles from WMT25 substantiate an increasingly sophisticated, multi-level approach to MT system development and validation, emphasizing both methodological rigor and practical adaptability across resource environments.