WMT2025 Shared Task Overview

Updated 8 September 2025

WMT2025 Shared Task is a comprehensive evaluation campaign that benchmarks state-of-the-art MT systems using multilingual data, advanced neural architectures, and innovative evaluation metrics.
The task employs techniques like transformer-based deep architectures, bidirectional training, and data diversification to push the boundaries of translation quality and system stability.
Submissions demonstrated high BLEU and COMET scores through integrated LLM-based learning and post-decoding methods, while also highlighting challenges in metric bias and contextual evaluation.

The WMT2025 Shared Task refers to the annual General Machine Translation (MT) evaluation campaign organized as part of the Conference on Machine Translation (WMT), focusing on benchmarking and advancing state-of-the-art MT systems across multiple language pairs, especially emphasizing multilingual and low-resource directions. The shared task involves submitting MT systems for evaluation—both through automatic metrics and human judgment—with an increasing emphasis on sophisticated training, adaptation, decoding, and evaluation techniques. The 2025 edition drew participation from a range of academic labs and industrial AI groups, with systems reflecting the integration of Transformer architectures, LLM-based translation, advanced adaptation methods, and post-decoding strategies.

1. Advances in Model Architectures and Training Strategies

Recent WMT shared tasks have demonstrated the convergence of robust neural machine translation (NMT) architectures and LLMs for enhanced translation. For example, HW-TSC’s submission for WMT24 (Wu et al., 23 Sep 2024) employed an exceptionally deep Transformer-big architecture (25-layer encoder, 6-layer decoder, self-attention with 16 heads, embedding dimension 1024, FFN dimension 4096, pre-layer normalization) as the backbone for their NMT system. This architecture was used in conjunction with a comprehensive set of training strategies:

Regularized Dropout (R-Drop): Introduced to minimize the divergence between multiple dropout realizations, using the KL divergence terms $KL(p_1 \Vert p_2) + KL(p_2 \Vert p_1)$ for output distribution consistency.
Bidirectional Training (BiT): Training on parallel data with both src→tgt and tgt→src directions enhances generalization and knowledge transfer.
Data Diversification (DD): Augmentation via synthetic data from both forward and backward teacher models increases training variability.
Forward and Back Translation (FT, BT): Synthetic parallel data using both source-side and target-side monolingual data, with tagged sources for diversity.
Alternated Training (AT): Cycles between synthetic and authentic data for stabilization.
Curriculum Learning (CL): Samples are introduced by difficulty, computed as $q(x, y) = [\log P(y|x; \theta_{in}) - \log P(y|x; \theta_{out})]/|y|$ ; harder examples are gradually introduced.
Transductive Ensemble Learning (TEL): Aggregates outputs from multiple models using available test sources.

Parallel work on the LLM side typically includes continue pre-training (CPT) on relevant monolingual data, supervised fine-tuning (SFT) with high-quality parallel data selected by metrics like COMETKIWI, and contrastive preference optimization (CPO).

2. Multilingual and Low-Resource Expansion

The WMT25 shared task expanded coverage to new language pairs, including non-European and low-resource cases such as Chinese, Korean, Japanese, Arabic, and Bhojpuri. The BSC team, for example, introduced the SALAMANDRATA models (Gilabert et al., 18 Aug 2025), trained from scratch for multilingual translation across 38 European languages, and subsequently adapted for broader coverage:

Tokenizer and Vocabulary Adaptation: The original tokenizer was retrained on a mixture of European and new language data, with embeddings for new tokens initialized as averages of the original space.
Continual Pre-training (CPT): Two stages (CPT-V1, CPT-V2) were used for massive parallel data coverage and careful balancing to avoid catastrophic forgetting.
Instruction Tuning: Supervised fine-tuning datasets (IT-V1, IT-V2) targeted more complex, context-aware translation objectives.

This process enables scalable, robust translation in low-resource settings, a direction also pursued by the In2x team focusing on Japanese (Pang et al., 20 Aug 2025), with a pipeline emphasizing creative and culturally faithful data construction, reward models tailored to task type (rule-based or generative), and a two-stage supervised and RL-based fine-tuning protocol.

3. Decoding and Post-Processing Strategies

Evaluation and output selection strategies play a major role in achieving competitive results under automatic ranking schemes:

Minimum Bayes Risk (MBR) Decoding: Used by HW-TSC (Wu et al., 23 Sep 2024), BSC (Gilabert et al., 18 Aug 2025), and others, MBR selects the translation from an N-best list that minimizes expected loss according to metrics like COMET. In practice, candidate $y^*$ is chosen as $y^* = \operatorname{argmin}_{y \in N\text{-best}} E_{y'}[d(y, y')]$ , where $d(\cdot, \cdot)$ is a distance measure.
Quality Estimation (QE) Re-ranking: Employing reference-free metrics such as COMET-KIWI for ranking translation candidates, as described in the preliminary ranking report (Kocmi et al., 11 Aug 2025).
Tuned Re-ranking (TRR): BSC used TRR via COMET-KIWI to further optimize output selection after MBR.

As supported by preliminary rankings, these post-decoding strategies significantly affect performance as judged by automatic metrics but may lead to discrepancies compared to human judgments.

Decoding Strategy	Metric Used	Role in WMT25 Submissions
MBR Decoding	COMET, COMET-22	Final selection from N-best
QE Re-ranking	COMET-KIWI	Reference-free candidate tuning
TRR	COMET-KIWI	Tuned final output selection

4. Evaluation Frameworks and Metric Bias

Automatic metrics have become central in system ranking, with ensemble approaches combining reference-free and learned scores. The WMT25 preliminary report (Kocmi et al., 11 Aug 2025) notes use of metrics such as GEMBA-ESA (GPT-4.1 and CommandA variants), MetricX-24-Hybrid-XL, XCOMET-XL, and surface-level chrF++ for very low-resource directions. Scoring is normalized across segments (median/inter-percentile scaling) to constitute the AutoRank.

A critical issue is metric bias—systems heavily tuned for automatic metrics via MBR or QE re-ranking can inflate their standings in automatic rankings, potentially diverging from ultimate human evaluations. Human protocols (e.g., Error Span Annotation) are highlighted as more reliable and nuanced, capturing adequacy, fluency, and context integrity—areas where automatic metrics can be limited.

5. Performance Results and System Comparisons

Submissions reported strong automatic metric results, especially when post-processing strategies were applied. HW-TSC (Wu et al., 23 Sep 2024) attained BLEU 59.34 and COMET 0.6928 for NMT-only, with further gains (BLEU ≈58.88, COMET ≈0.7178) from MBR over hypotheses, and a combined NMT+LLM ensemble BLEU of 56.41, COMET of 0.7234. BSC’s SALAMANDRATA-7B saw up to 4 COMET points improvement after instruction tuning, with post-decoding further closing gaps introduced in new language adaptation.

In2x's Japanese-centric approach (Pang et al., 20 Aug 2025) demonstrated that, by balancing instructional data and deploying sophisticated RL reward design, the model rivaled mainstream languages in performance, placing second in en–ja and first in unrestricted.

6. Challenges, Future Directions, and Implications

Key challenges identified across submissions include balancing metric excellence with true translation quality (human-adaptive systems), mitigating catastrophic forgetting in vocabulary adaptation, and engineering robust RL pipelines for stylistically diverse outputs. Suggestions from preliminary evaluation (Kocmi et al., 11 Aug 2025) emphasize:

Refining automatic metrics for closer alignment with human judgment and less vulnerability to metric gaming.
Developing document-level evaluation frameworks to judge contextual consistency.
Improving robustness in evaluation for low-resource pairs.
Understanding and controlling bias introduced by re-ranking.
Systematic integration of human and automatic evaluation for future task design.

The open release of SALAMANDRATA models (both 2B and 7B variants) on Hugging Face is highlighted as enhancing reproducibility and future experimental scope.

7. Significance and Outlook

The WMT2025 Shared Task continues to serve as a focal point for the evaluation and advancement of general machine translation technology, with system architectures, training procedures, adaptation strategies, and decoding methods pushing the boundaries for multilingual and culturally sensitive translation. The growing reliance on LLMs, advanced selection strategies, and the interplay between automatic and human evaluation sets the stage for subsequent research in high-quality, robust, and contextually aware MT. The community’s increased openness to releasing models further democratizes experimentation and fosters rapid improvement in both academic and industrial translation systems.