TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

Published 17 Jul 2025 in cs.CL | (2507.12724v1)

Abstract: We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system -- as well as MT-Ranker -- to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system's evaluation and reasoning, human assessments, as well as code is released.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel LLM-based prompting framework that provides detailed, reasoning-driven evaluations of machine translations.
The paper employs span decomposition and interleaved ranking strategies to reduce position bias and improve accuracy over traditional metrics.
The paper demonstrates that its evaluation method aligns with human judgments and outperforms existing metrics on diverse translation datasets.

TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

TransEvalnia introduces a prompting-based system for machine translation (MT) evaluation and ranking that leverages the reasoning capabilities of LLMs to provide fine-grained, multidimensional assessments. The system is designed to address the limitations of traditional MT metrics, which often reduce translation quality to a single score without offering interpretable justifications or detailed error analysis. TransEvalnia evaluates translations along a subset of the Multidimensional Quality Metrics (MQM), assigns numerical scores, and produces explicit rationales for its judgments, thereby enhancing transparency and utility for both system developers and end-users.

Motivation and Background

As LLM-based translation systems approach or surpass human-level performance on certain tasks, the need for robust, interpretable, and reliable evaluation methods becomes critical. Traditional metrics such as BLEU, while computationally efficient, have been shown to correlate poorly with human judgments, especially as translation quality increases and errors become more nuanced. Recent advances in learned metrics (e.g., COMET, MetricX) and LLM-based evaluators have improved correlation with human assessments, but most still lack explicit reasoning or fine-grained error categorization.

TransEvalnia is motivated by the observation that as translation quality converges, simple scalar scores are insufficient for system comparison, error analysis, or iterative improvement. Instead, detailed, reasoned feedback—especially along multiple quality dimensions—is necessary for both research and production settings.

System Design and Methodology

TransEvalnia operates by prompting LLMs (notably Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct) to perform structured evaluation and ranking of candidate translations. The evaluation process is decomposed into several key steps:

Span Decomposition and Dimension-wise Evaluation: Each translation is segmented into spans, and each span is evaluated along selected MQM dimensions:
- Accuracy
- Terminology
- Linguistic Conventions (or Emotional Content for poetry)
- Audience Appropriateness
- Hallucinations
- Missing Content
Reasoned Scoring and Ranking: For each dimension and span, the LLM provides a rationale and a Likert-scale score (1–5). An overall score and summary rationale are also produced.
Ranking Approaches: The system supports several ranking strategies:
- Single-step: All translations are evaluated and ranked in a single prompt.
- Two-step: Each translation is evaluated independently, then the evaluations are presented together for ranking.
- Three-step (Interleaved): Evaluations are interleaved across translations and dimensions before ranking, mitigating position bias.
Position Bias Mitigation: The interleaved approach, inspired by recent work on LLM evaluation bias, is empirically shown to reduce the impact of translation order on ranking outcomes.
No-Reasoning Baseline: For comparison, a no-reasoning approach is also implemented, where the LLM ranks translations without explicit evaluation or rationale.

Experimental Setup

TransEvalnia is evaluated on a diverse set of datasets, including:

English–Japanese proverbs, news, and haiku (with and without human reference translations)
A "hard" English–Japanese set with expert human ratings
Multiple WMT shared task datasets (English–German, Chinese–English, English–Japanese, Japanese–English, English–Russian, English–Spanish) with human MQM scores

Comparisons are made against state-of-the-art metrics and evaluators, including MT-Ranker, COMET-22/23, XCOMET-XXL, and MetricX-XXL.

Results

Ranking Accuracy

On most datasets (except WMT-2024 en-es), TransEvalnia matches or outperforms MT-Ranker in ranking accuracy.
For WMT-2023 en-de, the Qwen no-reasoning variant achieves the highest accuracy among all tested systems.
The interleaved (three-step) approach consistently reduces position bias compared to single- and two-step methods, as measured by bias inconsistency scores.
On WMT-2024 en-es, MT-Ranker achieves the best performance, likely due to the abundance of high-quality training data for this language pair.

Human Meta-Evaluation

Human raters (professional translators) agree with TransEvalnia's fine-grained evaluations at rates around 0.85, and with overall evaluations at rates between 0.60 and 0.69, depending on the vendor and LLM used.
The mean Likert scores assigned by human raters to LLM-generated translations (Sonnet, GPT-4o) are higher than those assigned to human reference translations, corroborating recent findings that LLMs can outperform human translators on certain tasks.
Spearman correlations between TransEvalnia's scores and human scores are on par with inter-rater correlations, indicating that the system's scoring is as reliable as human judgment.

Position Bias

Position bias remains a significant challenge for LLM-based evaluators. The interleaved approach reduces, but does not eliminate, this bias.
Both TransEvalnia and MT-Ranker exhibit sensitivity to translation order, with bias inconsistency scores varying across datasets and ranking strategies.

Fine-Tuning and Generalization

Fine-tuning Qwen on WMT MQM data using LoRA yields improvements in score correlation for English–Japanese translations, suggesting that targeted adaptation can enhance metric reliability across language pairs.

Implications and Future Directions

TransEvalnia demonstrates that LLMs, when properly prompted, can provide not only accurate translation rankings but also interpretable, fine-grained rationales that align with human expert assessments. This capability is particularly valuable for:

System development and debugging, where understanding specific error types is crucial
Human-in-the-loop workflows, where evaluators can audit and contest LLM judgments
Research on translation quality, especially for high-performing systems where differences are subtle

The persistent issue of position bias highlights the need for further research into LLM evaluation protocols and model architectures. Potential avenues include:

Mechanistic interventions in positional encoding and causal masking
Aggregation strategies over multiple prompt orderings
Incorporation of multi-agent or debate-based evaluation frameworks

The open-sourcing of code, data, and evaluation outputs facilitates reproducibility and further research.

Conclusion

TransEvalnia advances the state of MT evaluation by integrating LLM-based reasoning, multidimensional scoring, and explicit rationales into a unified, open-source framework. Its empirical performance, interpretability, and extensibility position it as a valuable tool for both academic research and industrial MT system development. The work also underscores the importance of addressing evaluation biases and adapting metrics to the evolving landscape of high-quality, LLM-driven translation systems.

Markdown Report Issue