Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

M-Prometheus: A Suite of Open Multilingual LLM Judges (2504.04953v1)

Published 7 Apr 2025 in cs.CL and cs.AI

Abstract: The use of LLMs for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on natively multilingual feedback data instead of translated data. We release our models, training dataset, and code.

Summary

  • The paper introduces the M-Prometheus suite, offering multilingual LLM judges that assess long-form texts with both direct scores and pairwise feedback.
  • It uses finetuned Qwen2.5-Instruct models and diverse multilingual datasets to outperform prior LLM judges on several benchmarks.
  • Findings highlight improved practical utility in quality-aware decoding and the value of native training data for robust cross-lingual evaluation.

This paper introduces M-Prometheus (2504.04953), a suite of open-weight multilingual LLM judges designed to evaluate long-form text outputs in multiple languages. It addresses the limitation of most existing "LLM-as-a-judge" models, which are primarily optimized for English, hindering the development and evaluation of multilingual capabilities in other LLMs.

Core Contribution: M-Prometheus Suite

  • Models: The suite includes models with 3B, 7B, and 14B parameters, finetuned from the Qwen2.5-Instruct series.
  • Capabilities:
    • Provides both direct assessment (DA) scores (1-5) and pairwise comparison (PWC) feedback.
    • Evaluates outputs in over 20 languages (including Arabic, Chinese, French, German, Hindi, Japanese, Korean, Portuguese, Spanish, Vietnamese, etc.).
    • Supports both reference-based and reference-free evaluation.
  • Input/Output Format: Models accept multilingual instructions, model responses, and optional reference answers, along with an English scoring rubric. They output detailed feedback and a final judgment (score or preference indicator), typically in English by default, but can be prompted for other languages. (See Appendix A.1, A.2 for format details and examples).

Training Methodology & Data

The training recipe is inspired by Prometheus 2 but adapted for multilingual evaluation. Key components include:

  1. Base Data: Prometheus 2's English Feedback and Preference Collections (generated using GPT-4).
  2. Natively Multilingual Data:
    • M-FEEDBACK COLLECTION (DA): Synthetically generated DA instances (instruction, responses of varying quality, feedback, score, optional reference) for 5 non-English languages (French, Portuguese, Greek, Chinese, Hindi) using Claude-Sonnet-3.5. This ensures fluency and avoids translation artifacts.
    • M-PREFERENCE COLLECTION (PWC): Synthesized PWC instances derived from M-FEEDBACK COLLECTION, again using Sonnet for generating preference feedback.
  3. Machine Translation (MT) Evaluation Data:
    • Synthetically generated DA instances for MT evaluation across 8 language pairs (en-de, en-cs, en-es, en-uk, en-ru, en-zh, en-ja, en-hi) using Claude-Sonnet-3.5.
    • Includes source text, candidate translations (scored 1-5), optional reference translation, and rubrics.
    • This data helps train models for both reference-based and reference-free MT evaluation and improves general multilingual capabilities.

Performance & Evaluation

M-Prometheus models were evaluated across several benchmarks:

  • General Multilingual Capabilities: Outperforms state-of-the-art open judges (Prometheus 2, Glider, Hercule) and even the much larger proprietary GPT-4O on MM-Eval (a benchmark with mostly native, non-translated data). Achieves strong performance on M-RewardBench (translated benchmark). Excels in categories like Safety, Linguistics, and detecting Language Hallucinations.
  • English Capabilities: Retains or slightly improves performance compared to the Qwen2.5 backbone models on RewardBench.
  • Literary Machine Translation: Significantly outperforms other open judges and baseline models on LitEval-Corpus (reference-free DA), demonstrating strong capabilities on challenging cross-lingual tasks.
  • Practical Utility (QAD): When used for Quality-Aware Decoding (best-of-n sampling) on M-ArenaHard, M-Prometheus significantly improves the output quality of a base model (Qwen2.5-3B-Instruct) in French, Chinese, and Hindi, achieving high win rates (e.g., M-Prometheus 7B achieves 66.37% win rate averaged across languages). This directly demonstrates the practical benefit of using these judges for improving generation.

Key Findings & Implementation Insights (Ablations)

The paper provides valuable insights for training multilingual judges:

  1. Backbone Model Choice is Crucial: The underlying instruction-tuned model significantly impacts final judge performance. Qwen2.5-Instruct proved superior to Mistral-7B, EuroLLM-9B, and Aya-Expanse-8B in their experiments. Simply using a backbone with more multilingual pretraining data doesn't guarantee the best judge.
  2. Natively Multilingual Data > Translated Data: Training on data generated directly in target languages is significantly more effective than translating English training data. Translated data yielded worse results on several benchmarks compared to English-only or natively multilingual training. This contradicts findings from the Hercule paper, potentially due to differences in evaluation benchmarks.
  3. MT Evaluation Data Transfer: Including cross-lingual MT evaluation data positively transfers to general multilingual evaluation tasks, particularly for identifying language mixing (hallucinations). Conversely, training on general multilingual data also improves MT evaluation performance.
  4. Native Data for Practical Utility: Judges trained on natively multilingual data showed the best performance in the practical QAD task, indicating this data type is key for building judges that can effectively improve model outputs at inference time.
  5. Language Coverage Trade-offs: Increasing the number of non-English languages in the training data (from 3 to 5) improved performance on most intrinsic benchmarks but slightly decreased performance on the practical QAD task.
  6. Model Scale: Performance generally increases with scale, especially on general benchmarks and LitEval. However, diminishing returns are observed, particularly for QAD where the 7B model outperformed the 14B model. Finetuning provides the largest relative benefit for smaller (3B) models.

Practical Applications & Resources

  • Evaluating Multilingual LLMs: Use M-Prometheus for DA or PWC evaluation of chatbots, instruction-following models, or MT systems in various languages.
  • Improving Generation: Apply M-Prometheus with techniques like Quality-Aware Decoding (QAD) or best-of-n sampling at inference time to select higher-quality outputs from multilingual models.
  • Data Filtering/Distillation: Potentially use M-Prometheus scores to filter datasets or distill knowledge from better models during training.

The authors release the M-Prometheus models (3B, 7B, 14B), the M-FEEDBACK COLLECTION and M-PREFERENCE COLLECTION training datasets, and the code required to reproduce their experiments on Hugging Face, enabling practical adoption by developers.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 100 likes.

Upgrade to Pro to view all of the tweets about this paper: