Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate (2401.16788v1)

Published 30 Jan 2024 in cs.CL and cs.AI
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Abstract: Despite the utility of LLMs across a wide range of tasks and scenarios, developing a method for reliably evaluating LLMs across varied contexts continues to be challenging. Modern evaluation approaches often use LLMs to assess responses generated by LLMs. However, the meta-evaluation conducted to assess the effectiveness of these LLMs as evaluators is typically constrained by the coverage of existing benchmarks or requires extensive human annotation. This underscores the urgency of methods for scalable meta-evaluation that can effectively, reliably, and efficiently evaluate the performance of LLMs as evaluators across diverse tasks and scenarios, particularly in potentially new, user-defined scenarios. To fill this gap, we propose ScaleEval, an agent-debate-assisted meta-evaluation framework that leverages the capabilities of multiple communicative LLM agents. This framework supports multi-round discussions to assist human annotators in discerning the most capable LLMs as evaluators, which significantly eases their workload in cases that used to require large-scale annotations during meta-evaluation. We release the code for our framework, which is publicly available at: \url{https://github.com/GAIR-NLP/scaleeval}.

Scalable Meta-Evaluation of LLMs as Evaluators Via Agent Debate

Introduction

LLMs have been integral in pushing the boundaries of what's achievable in natural language processing and generative AI. Their versatility and capability to adapt to various tasks have led to significant interest in employing these models not just as solution generators but also as evaluators of content across numerous domains. However, the challenge of efficiently and accurately validating the effectiveness of LLMs as evaluators remains. This paper introduces ScaleEval, a novel framework designed to meta-evaluate LLMs using an agent-debate approach, aiming to streamline the process and reduce reliance on extensive human annotation.

Meta-Evaluation Challenges and ScaleEval's Approach

Traditionally, evaluating LLMs necessitates comprehensive human-annotated benchmarks, which are both costly and time-consuming to create. As the application of LLMs spans a growing number of tasks, generating specific benchmarks for each becomes impractical. ScaleEval proposes a solution by enabling scalable meta-evaluation through an innovative mechanism that leverages agent debates, thus reducing the human annotation burden significantly.

This multi-agent discussion system involves deploying multiple LLM agents in rounds of discussion on given prompts, evaluating the responses generated by LLMs under investigation. Herein lies the flexibility of ScaleEval: it allows users to define their criteria and scenarios, adapting the evaluation process to a wide range of contexts.

Experiments and Findings

The experiments conducted to test ScaleEval's efficacy reveal its potential in closely mirroring human expert judgments across different scenarios, including but not limited to brainstorming, coding, and math problems. The agent-debate approach demonstrates a high example and system-level agreement rate with human annotations, suggesting that ScaleEval can reliable substitute for extensive human judgment in many instances.

Further exploration into the capabilities and limitations of LLMs as evaluators underlines the variability in their performance based on the scenarios and the types of prompts used. Interestingly, modifications to prompts, such as masking or gibberish, reveal a limitation in the LLM evaluators' ability to maintain their evaluative accuracy, indicating areas for future improvement.

Implications and Future Directions

The introduction of ScaleEval opens new pathways for the meta-evaluation of LLMs, offering a scalable alternative to traditional benchmarking methods. Its adaptability to various scenarios and criteria without the need for extensive bespoke datasets is a significant step forward.

Moreover, the findings highlight the nuanced understanding required in selecting and configuring LLMs as evaluators, pointing to the importance of ongoing research in this area. Future developments could focus on enhancing LLM evaluator robustness to prompt modifications and further reducing the need for human intervention.

Conclusion

ScaleEval represents a significant contribution to the domain of LLM evaluation, addressing the critical challenge of scalability in meta-evaluation. By leveraging the agent-debate mechanism, it opens up new possibilities for efficiently validating and improving LLMs as evaluators across a broad spectrum of tasks. As the research community continues to explore the vast potentials of generative AI, tools like ScaleEval will be indispensable in ensuring these models not only generate high-quality outputs but can also reliably assess the quality of content across diverse applications.

Acknowledgments to the team for their pioneering effort and to the broader community for their continued engagement and feedback, which will undoubtedly shape the future iterations of ScaleEval and similar endeavors in the field of AI and machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv:2302.04023v3.
  2. Re-evaluating evaluation in text summarization. arXiv preprint arXiv:2010.07100.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  4. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
  5. Evaluating large language models trained on code.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  7. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  8. Overview of the tac 2008 update summarization task. In TAC.
  9. Results of wmt22 metrics shared task: Stop using bleu–neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68.
  10. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  11. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435.
  12. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  13. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  14. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  15. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv:2003.11080v5.
  16. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  17. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  18. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
  19. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  20. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  21. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  22. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  23. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  24. Large language models are not fair evaluators. ArXiv, abs/2305.17926.
  25. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  26. Agieval: A human-centric benchmark for evaluating foundation models. arXiv:2304.06364v2.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Steffi Chern (11 papers)
  2. Ethan Chern (11 papers)
  3. Graham Neubig (342 papers)
  4. Pengfei Liu (191 papers)
Citations (12)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets