WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild (2406.04770v2)

Published 7 Jun 2024 in cs.CL and cs.AI

Abstract: We introduce WildBench, an automated evaluation framework designed to benchmark LLMs using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of slightly better/worse'' totie'' if the winner response exceeds the loser one by more than $K$ characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

PDF HTML Abstract

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Abstract

The paper introduces WildBench, an automated evaluation framework specifically designed to benchmark LLMs using diverse and challenging queries sourced from real-world user interactions. WildBench comprises 1,024 tasks selected meticulously from over one million human-chatbot conversation logs, ensuring a realistic and broad spectrum of user queries. Uses two evaluation metrics, WB-Reward and WB-Score, leveraging advanced LLM capabilities to provide systematic, interpretable, and reliable automatic judgments of model outputs. Importantly, WildBench demonstrates strong correlations with human-voted Elo ratings from Chatbot Arena, highlighting its efficacy and alignment with human-based evaluation.

Introduction

LLMs have become deeply integrated into numerous real-world applications due to their impressive generalization capabilities. However, evaluating their performance reliably and cost-effectively remains a significant challenge. Traditional benchmarks often employ datasets like MMLU, which primarily assess reasoning abilities through multiple-choice questions, failing to encompass the open-ended nature of real-world user queries. While human evaluations, such as those conducted by Chatbot Arena, offer valuable insights, they are labor-intensive, not real-time, and lack data transparency.

WildBench addresses these challenges by providing a comprehensive, automated, and dynamic evaluation framework that better reflects the wide range of user queries encountered in real-world settings. This is achieved through meticulous task selection from the WildChat dataset and innovative evaluation strategies involving pairwise and individual comparisons.

Methodology

Data Curation:

The WildBench dataset is curated from the WildChat project, which contains a million human-chatbot dialogues, ensuring a rich diversity of tasks such as writing assistance, coding, and data analysis. The tasks undergo rigorous filtering to maintain quality and relevance, including steps to remove nonsensical and redundant queries. Advanced LLMs are used to rate task difficulty, excluding easy tasks, and ensuring a natural distribution of task categories.

Evaluation Metrics:

WB-Reward: Based on pairwise model comparisons using three baseline models of varying performance levels to ensure robustness. A length penalty method is introduced to mitigate biases towards longer responses.
WB-Score: An individual scoring method that is faster and cost-effective. Scores are adjusted to better differentiate model performances.

Results and Analysis

The paper highlights the performance of various models on WildBench, noting that the evaluations correlate strongly with human judgments from Chatbot Arena. WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models, while WB-Score reaches 0.95, outperforming other benchmarks such as ArenaHard and AlpacaEval2.0.

Interestingly, WildBench reveals that models with similar output lengths can have markedly different performance scores, indicating that response quality, rather than length, is the critical factor in evaluations. This robustness to length bias is a notable strength of the proposed framework.

Implications

WildBench provides a robust, realistic, and dynamic benchmark for evaluating LLMs. Its strong correlation with human evaluations indicates that it can serve as a reliable proxy for human judgment, enabling more cost-effective and scalable assessments. This has significant practical implications, allowing developers to efficiently benchmark and improve LLMs in ways that are aligned with real-world user expectations and needs.

Future Directions

Future work could include:

Human-in-the-Loop Benchmarking: Integrating human feedback to refine LLM judge decisions.
Multiple Judges Mixture: Aggregating decisions from multiple LLM judges for a more reliable overall ranking.
Dynamic Updating Leaderboards: Implementing mechanisms to automatically update the leaderboard with new models and tasks.

Conclusion

WildBench provides a comprehensive solution for automated LLM evaluation using real-world user queries. With its strong alignment with human judgment and robust mitigation of length bias, it sets a new standard for practical and reliable LLM benchmarking, paving the way for continuous improvements in LLM development and application.