Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

TransEvalnia: Reasoning-based Evaluation and Ranking of Translations (2507.12724v1)

Published 17 Jul 2025 in cs.CL

Abstract: We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system -- as well as MT-Ranker -- to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system's evaluation and reasoning, human assessments, as well as code is released.

Summary

  • The paper introduces a prompting-based system that leverages LLM reasoning to provide detailed, explainable multi-dimensional evaluations of translations.
  • It proposes innovative evaluation modes—single, two-step, and three-step interleaved—to reduce position bias and enhance ranking accuracy.
  • Experiments show strong alignment with human annotations, achieving robust score correlations that validate its transparent assessment approach.

TransEvalnia: Reasoning-based Evaluation and Ranking of Translations

TransEvalnia introduces a prompting-based system for machine translation (MT) evaluation and ranking that leverages the reasoning capabilities of LLMs to provide fine-grained, explainable assessments. The system is designed to address the limitations of traditional MT metrics, which typically yield only scalar scores without interpretability or actionable feedback, by producing detailed, multi-dimensional evaluations grounded in the Multidimensional Quality Metrics (MQM) framework.

Motivation and Context

As LLM-based translation systems approach or surpass human-level performance on certain tasks, the need for robust, nuanced evaluation methods becomes critical. Conventional metrics such as BLEU, while computationally efficient, are increasingly inadequate for distinguishing high-quality outputs or providing diagnostic feedback. Recent advances in LLM-based evaluators (e.g., COMET, MetricX, MT-Ranker) have improved correlation with human judgments, but most still lack transparent reasoning or fine-grained error analysis.

TransEvalnia is positioned to fill this gap by:

  • Generating span-level, multi-dimensional evaluations (e.g., accuracy, terminology, linguistic conventions, audience appropriateness, hallucinations, missing content).
  • Providing explicit rationales for each assessment, enabling users to audit and interpret the evaluation process.
  • Supporting both ranking and scoring of candidate translations, with outputs that can be directly compared to human MQM-style annotations.

System Design and Methodology

TransEvalnia operates in several configurable modes, each designed to balance evaluation quality and robustness to position bias:

  1. Single-step Evaluation and Ranking: The LLM is prompted with the source text and all candidate translations, and asked to both evaluate and rank them in a single pass. This approach is efficient but susceptible to position bias, as the order of translations can influence the model's preferences.
  2. Two-step Evaluation and Ranking: Each translation is evaluated independently, and the resulting evaluations are then presented to the LLM for ranking. This decouples evaluation from ranking and reduces position bias.
  3. Three-step Interleaved Evaluation: Evaluations for each translation are interleaved at the span and dimension level before ranking, further mitigating position bias by distributing translation components throughout the input.
  4. No-reasoning Baseline: The LLM is prompted to rank translations without providing explicit reasoning. While this can yield high agreement with human rankings, it exacerbates position bias and lacks interpretability.

The system supports both open and proprietary LLMs, including Anthropic's Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct, and is evaluated across a diverse set of language pairs and genres, including challenging cases such as proverbs and haiku.

Evaluation Criteria

TransEvalnia's evaluation dimensions are a subset of MQM, tailored for both general and poetic texts:

  • Accuracy: Faithfulness to the source meaning.
  • Terminology: Correctness and appropriateness of terms.
  • Linguistic Conventions: Fluency and grammaticality.
  • Audience Appropriateness: Suitability for the target audience.
  • Hallucinations: Unjustified content not present in the source.
  • Missing Content: Omitted information.
  • For poetry, Emotional Content replaces linguistic conventions.

Each translation is segmented into spans, and each span is evaluated along these dimensions, with both qualitative rationales and Likert-scale scores.

Experimental Results

TransEvalnia is benchmarked against state-of-the-art systems (MT-Ranker, COMET, MetricX) on datasets with human MQM-style annotations, including WMT shared tasks and custom English-Japanese corpora. Key findings include:

  • Ranking Accuracy: TransEvalnia matches or outperforms MT-Ranker on most datasets, except for WMT-2024 English-Spanish, where MT-Ranker has a slight edge, likely due to the abundance of training data for that language pair.
  • Position Bias: The three-step interleaved approach consistently yields the lowest position bias across most datasets, as measured by bias inconsistency scores. However, no method fully eliminates position bias, and even state-of-the-art systems like MT-Ranker are affected.
  • Human Meta-evaluation: Human raters (professional translators) agree with TransEvalnia's fine-grained evaluations at rates around 0.85, and with overall evaluations at 0.60–0.69, depending on the vendor and LLM used. Notably, LLM-generated translations (Sonnet, GPT-4o) are sometimes rated higher than human references, corroborating recent findings that LLMs can surpass human translators on certain tasks.
  • Score Correlation: Spearman correlations between TransEvalnia's overall scores and human ratings are on par with inter-annotator agreement, indicating that the system's scores are as reliable as human judgments for ranking purposes.

Implementation Considerations

TransEvalnia is fully open-sourced, with code and data available for reproducibility. The system is modular, supporting different LLM backends and evaluation strategies. For practical deployment:

  • Computational Requirements: Running evaluations with large LLMs (e.g., Qwen-2.5-72B) requires significant GPU resources, especially for batch processing or large-scale benchmarking.
  • Prompt Engineering: The effectiveness of reasoning-based evaluation is sensitive to prompt design. The released prompts are tailored for each language pair and genre, and further adaptation may be necessary for new domains.
  • Fine-tuning: Fine-tuning LLMs on MQM-labeled data (using methods such as LoRA) can improve score correlation and generalization across language pairs.
  • Position Bias Mitigation: For high-stakes evaluation, the three-step interleaved approach is recommended, despite increased complexity, to minimize order effects.

Implications and Future Directions

TransEvalnia demonstrates that LLMs, when properly prompted, can serve as transparent, multi-dimensional evaluators for MT, providing actionable feedback that is both interpretable and aligned with human judgments. The system's open-source nature facilitates adoption and further research.

Several open challenges remain:

  • Residual Position Bias: While interleaving mitigates position bias, it is not fully eliminated. Further research into model architectures and input encoding strategies is warranted.
  • Generalization to Other Tasks: The reasoning-based, multi-dimensional evaluation paradigm is applicable beyond MT, including summarization, dialogue, and code generation, provided appropriate evaluation rubrics are defined.
  • Human-in-the-loop Evaluation: The explicit rationales produced by TransEvalnia enable human auditors to scrutinize and contest automated evaluations, supporting more robust and trustworthy MT system development.

In summary, TransEvalnia advances the state of MT evaluation by integrating LLM-based reasoning, fine-grained error analysis, and robust ranking strategies, setting a new standard for explainable and reliable translation assessment.