Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions (2405.20267v4)

Published 30 May 2024 in cs.CL
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions

Abstract: As LLMs continuously evolve, there is an urgent need for a reliable evaluation method that delivers trustworthy results promptly. Currently, static benchmarks suffer from inflexibility and unreliability, leading users to prefer human voting platforms like Chatbot Arena. However, human evaluations require significant manual effort. To address this, we propose the Auto-Arena, an innovative framework that automates the entire evaluation process using LLM-powered agents. Firstly, an LLM examiner generates questions. Then, two LLM candidates engage in a multi-round peer battle based on individual questions, aiming at revealing their true performance differences. Finally, a committee of LLM judges collaboratively discusses and decides the winner, reducing bias and enhancing fairness. During the peer battles, we observe intriguing scenarios where the LLM candidates display competitive behaviors and even learn from the opponents. In our extensive experiments involving 15 recent LLMs, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks without any manual efforts. As a result, Auto-Arena offers a promising alternative to current human evaluation platforms for evaluating LLMs automatically.

Auto Arena of LLMs: Automating Evaluations with Agent Peer-battles and Committee Discussions

The rapid development and deployment of LLMs present a daunting challenge for those tasked with evaluating their capabilities in a timely manner. Traditional static benchmarks face issues of dataset contamination and may not adequately capture the dynamic nature of LLM performance. Human evaluations, while thorough, demand significant manual effort and are often slow to adjust to new models. In response to these challenges, the discussed paper introduces a novel framework, the Auto-Arena of LLMs, designed to automate LLM evaluation using agent peer-battles and committee discussions.

The Auto-Arena framework is structured into three sequential stages: question generation, peer battles, and committee discussions, all operated by LLM agents, thereby eliminating the need for human intervention in the evaluation process. The design intends to offer an evaluation system that mimics human-like assessment while overcoming the limitations of static datasets and the biases inherent in model evaluations.

Framework Components and Methodology

  1. Question Generation: The process begins with an examiner LLM tasked with designing diverse and complex queries. These questions, spanning domains such as writing, roleplay, extraction, reasoning, math, and more, form the basis of the peer-battles. Utilizing LLMs for question generation minimizes data contamination risks by avoiding reliance on static datasets, which could have inadvertently influenced training data of LLMs.
  2. Peer Battles: At the heart of Auto-Arena is a peer-battle mechanism where two LLMs engage in multiple rounds of debate over the proposed query. Through structured interactions, including criticizing each other's responses and posing follow-up questions, LLMs reveal performance gaps, allowing evaluators to observe capabilities beyond initial responses. This debate format not only enhances the evaluation of LLMs’ comprehensiveness and adaptability but also uncovers nuanced differences in performance otherwise masked by one-off responses.
  3. Committee Discussions: Following the peer battles, a panel of LLM judges, selected from top-ranking models, evaluates the outcomes. The committee's task is to mimic a peer review process, bringing together diverse evaluations that mitigate single-model biases. In contentious scenarios or when performance levels are closely matched, the committee approach facilitates more balanced and representative decision-making. This final adjudication stage is designed to parallel the consensus-building process observed in human evaluations, further aligning LLM assessments with human standards.

Experimental Findings and Analysis

The paper reports on extensive experimentation involving 17 contemporary LLMs. A notable achievement of the Auto-Arena framework is its high correlation with human preference data, as benchmarked against platforms like Chatbot Arena. Auto-Arena shows superior alignment (96.4% Spearman correlation) with human evaluations compared to traditional static benchmark methods and model-based approaches, indicating its potential as a more accurate reflection of LLM capabilities.

The research highlights significant improvements in evaluation reliability when integrating peer battles, showcasing a 46.4% increase in alignment with human preferences post-battle. This supports the hypothesis that an interactive and dynamic evaluation process draws out more visible capability differentials among models. Furthermore, committee discussions increased agreement metrics significantly, by as much as 20%, reaffirming the effectiveness of collaborative evaluation methods.

An area of focus within the paper is the scalability and adaptability of Auto-Arena to non-English languages and specific domains, exemplified by an extension to Chinese LLM evaluations. This adaptability ensures that Auto-Arena can serve as a globally relevant tool for LLM evaluation, breaking the language barrier prevalent in many existing models.

Implications and Future Directions

The Auto-Arena framework presents a significant stride towards autonomous, reliable LLM evaluation methods. Its architecture not only addresses current issues with static benchmarks and biased single-model judgments but also introduces a scalable system that can seamlessly adapt to assess new models as they are developed.

Future research inspired by Auto-Arena could delve into enhancing the evaluative capabilities of LLM judges, perhaps by leveraging advanced ensemble methods or exploring the incorporation of cross-disciplinary LLMs to serve in specialized committee roles. Additionally, studying the dynamics of competitive behavior and self-improvement exhibited by LLMs within peer battles could unlock new avenues for refining training paradigms.

In conclusion, Auto-Arena of LLMs is a forward-thinking approach to evaluating the ever-evolving landscape of LLMs. By automating the process of evaluation through structured peer interactions and committee assessments, it sets a high bar for the development of robust, responsive, and fair LLM benchmarking tools.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ruochen Zhao (15 papers)
  2. Wenxuan Zhang (75 papers)
  3. Yew Ken Chia (24 papers)
  4. Deli Zhao (66 papers)
  5. Lidong Bing (144 papers)
  6. Weiwen Xu (19 papers)
Citations (4)