Emergent Mind


Despite the utility of Large Language Models (LLMs) across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained by the coverage of existing benchmarks or requires extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist human annotators in discerning the most capable LLMs as evaluators, which significantly eases their workload in cases that used to require large-scale annotations during meta-evaluation. We release the code for our framework, which is publicly available at: \url{https://github.com/GAIR-NLP/scaleeval}.


  • The paper introduces ScaleEval, a framework for the meta-evaluation of LLMs using an agent-debate approach to streamline validation processes.

  • ScaleEval reduces the need for comprehensive human-annotated benchmarks, offering a scalable alternative for evaluating LLMs across various tasks.

  • Through experiments, ScaleEval displayed high agreement rates with human judgments in diverse scenarios, suggesting its efficacy in closely mirroring human expert evaluations.

  • Findings indicate the potential for ScaleEval to significantly reduce human annotation burdens while also highlighting the need for future improvements in LLM evaluators' robustness to prompt modifications.


LLMs have been integral in pushing the boundaries of what's achievable in natural language processing and generative AI. Their versatility and capability to adapt to various tasks have led to significant interest in employing these models not just as solution generators but also as evaluators of content across numerous domains. However, the challenge of efficiently and accurately validating the effectiveness of LLMs as evaluators remains. This paper introduces ScaleEval, a novel framework designed to meta-evaluate LLMs using an agent-debate approach, aiming to streamline the process and reduce reliance on extensive human annotation.

Meta-Evaluation Challenges and ScaleEval's Approach

Traditionally, evaluating LLMs necessitates comprehensive human-annotated benchmarks, which are both costly and time-consuming to create. As the application of LLMs spans a growing number of tasks, generating specific benchmarks for each becomes impractical. ScaleEval proposes a solution by enabling scalable meta-evaluation through an innovative mechanism that leverages agent debates, thus reducing the human annotation burden significantly.

This multi-agent discussion system involves deploying multiple LLM agents in rounds of discussion on given prompts, evaluating the responses generated by LLMs under investigation. Herein lies the flexibility of ScaleEval: it allows users to define their criteria and scenarios, adapting the evaluation process to a wide range of contexts.

Experiments and Findings

The experiments conducted to test ScaleEval's efficacy reveal its potential in closely mirroring human expert judgments across different scenarios, including but not limited to brainstorming, coding, and math problems. The agent-debate approach demonstrates a high example and system-level agreement rate with human annotations, suggesting that ScaleEval can reliable substitute for extensive human judgment in many instances.

Further exploration into the capabilities and limitations of LLMs as evaluators underlines the variability in their performance based on the scenarios and the types of prompts used. Interestingly, modifications to prompts, such as masking or gibberish, reveal a limitation in the LLM evaluators' ability to maintain their evaluative accuracy, indicating areas for future improvement.

Implications and Future Directions

The introduction of ScaleEval opens new pathways for the meta-evaluation of LLMs, offering a scalable alternative to traditional benchmarking methods. Its adaptability to various scenarios and criteria without the need for extensive bespoke datasets is a significant step forward.

Moreover, the findings highlight the nuanced understanding required in selecting and configuring LLMs as evaluators, pointing to the importance of ongoing research in this area. Future developments could focus on enhancing LLM evaluator robustness to prompt modifications and further reducing the need for human intervention.


ScaleEval represents a significant contribution to the domain of LLM evaluation, addressing the critical challenge of scalability in meta-evaluation. By leveraging the agent-debate mechanism, it opens up new possibilities for efficiently validating and improving LLMs as evaluators across a broad spectrum of tasks. As the research community continues to explore the vast potentials of generative AI, tools like ScaleEval will be indispensable in ensuring these models not only generate high-quality outputs but can also reliably assess the quality of content across diverse applications.

Acknowledgments to the team for their pioneering effort and to the broader community for their continued engagement and feedback, which will undoubtedly shape the future iterations of ScaleEval and similar endeavors in the field of AI and machine learning.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.

  1. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
  2. Re-evaluating Evaluation in Text Summarization
  3. Sparks of Artificial General Intelligence: Early experiments with GPT-4
  4. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
  5. Evaluating large language models trained on code
  6. Evaluating Large Language Models Trained on Code
  7. Can Large Language Models Be an Alternative to Human Evaluations?
  8. Overview of the tac 2008 update summarization task. In TAC.
  9. Results of wmt22 metrics shared task: Stop using bleu–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68.
  10. GPTScore: Evaluate as You Desire
  11. PAL: Program-aided Language Models
  12. Gemini: A Family of Highly Capable Multimodal Models
  13. Measuring Massive Multitask Language Understanding
  14. Measuring Mathematical Problem Solving With the MATH Dataset
  15. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
  16. Large Language Models Cannot Self-Correct Reasoning Yet
  17. Generative Judge for Evaluating Alignment
  18. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
  19. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.

  20. Holistic Evaluation of Language Models
  21. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  22. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  23. Is ChatGPT a Good NLG Evaluator? A Preliminary Study
  24. Large Language Models are not Fair Evaluators
  25. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  26. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Show All 26