Emergent Mind

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

(2306.05685)
Published Jun 9, 2023 in cs.CL and cs.AI

Abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.
Comparison of user dialogues with two AI assistants, assessed by GPT-4 for response quality.

Overview

  • The study introduces MT-bench and Chatbot Arena as new benchmarks to automate chatbot evaluation using LLMs like GPT-4 as judges, aligned with human preferences.

  • MT-bench challenges chatbots across various use cases, while Chatbot Arena leverages user interactions to collect data on chatbot performance in real-world scenarios.

  • LLM judges, exemplified by GPT-4, match human evaluations with an agreement rate exceeding 80%, overcoming biases through various strategies.

  • The research encourages the use of LLMs for scalable chatbot evaluations and proposes future developments, including a hybrid evaluation framework and the use of open-source models like Vicuna-13B.

Introduction

The advent of LLMs has introduced sophisticated chat assistants capable of engaging in human-like dialogues across an array of subjects. Despite their prowess, gauging their efficacy, especially in open-ended interactions relative to human preferences, remains a challenging endeavor. This study introduces a novel approach to automate the evaluation of chatbots by leveraging strong LLMs as judges, exemplified by GPT-4, to rate these models against human preferences through two new benchmarks: MT-bench and Chatbot Arena.

MT-Bench and Chatbot Arena

MT-Bench

MT-bench, comprising a series of open-ended questions designed to challenge a chatbot's conversational and instruction-following abilities, provides a structured way to evaluate chat assistants across various common use cases. With categories ranging from writing to math, it sets up a comprehensive arena to test these models beyond standard benchmarks.

Chatbot Arena

Chatbot Arena serves as a crowdsourced evaluation platform where users interact with pairs of anonymous chatbots. Through direct user engagement, it captures a wide spectrum of human preferences in real-world scenarios, thus collecting diverse data on chatbot performance.

LLM as a Judge

The concept of using state-of-the-art LLMs as judges proposes an innovative method for assessing chat assistant performance. The predecessors, like GPT-4, thanks to their training, exhibit strong alignment with human preferences. This study systematically explores the "LLM-as-a-judge" approach against the gold standard of human evaluation, revealing that GPT-4 judges can match human evaluations with an agreement rate exceeding 80%.

Overcoming Limitations

The study identifies and addresses several limitations inherent in the LLM-as-a-Judge approach, including position bias and verbosity bias. Through strategies like swapping positions, implementing few-shot judges, and employing chain-of-thought and reference-guided judging techniques, it mitigates these issues, showcasing high agreement rates between LLM judges and human evaluations.

Implications and Future Developments

The findings underscore the potential of using LLMs as scalable alternatives to human evaluations, especially where acquiring human preferences is resource-intensive. By releasing the benchmarks and datasets publicly, this research paves the way for future explorations into automated and nuanced chatbot evaluations.

Furthermore, the study hints at a hybrid evaluation framework combining capability-based benchmarks with preference-based ones, urging the adoption of LLM-as-a-judge in future LLM benchmarks. The exploration into using Vicuna-13B, an open-source model, as a judge after fine-tuning, hints at more cost-effective and accessible evaluation methodologies, opening new avenues for budget-conscious and open-source projects.

Conclusion

This paper presents a pioneering study on employing LLMs as judges to evaluate chatbot performance, illustrating a path to automate and scale the evaluation process efficiently. The proposed benchmarks, MT-bench, and Chatbot Arena, along with the comprehensive analysis of the LLM-as-a-Judge approach, mark a significant step forward in chatbot evaluation research. With consistent agreements exceeding 80% between LLM judges and human preferences, the study sets a precedent for future evaluations, emphasizing the importance of both human-centric and capability-based benchmarks in shaping the evolution of chat assistants.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

YouTube
Test Your Knowledge

You answered out of questions correctly.

Well done!