WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
Abstract
The paper introduces WildBench, an automated evaluation framework specifically designed to benchmark LLMs using diverse and challenging queries sourced from real-world user interactions. WildBench comprises 1,024 tasks selected meticulously from over one million human-chatbot conversation logs, ensuring a realistic and broad spectrum of user queries. Uses two evaluation metrics, WB-Reward and WB-Score, leveraging advanced LLM capabilities to provide systematic, interpretable, and reliable automatic judgments of model outputs. Importantly, WildBench demonstrates strong correlations with human-voted Elo ratings from Chatbot Arena, highlighting its efficacy and alignment with human-based evaluation.
Introduction
LLMs have become deeply integrated into numerous real-world applications due to their impressive generalization capabilities. However, evaluating their performance reliably and cost-effectively remains a significant challenge. Traditional benchmarks often employ datasets like MMLU, which primarily assess reasoning abilities through multiple-choice questions, failing to encompass the open-ended nature of real-world user queries. While human evaluations, such as those conducted by Chatbot Arena, offer valuable insights, they are labor-intensive, not real-time, and lack data transparency.
WildBench addresses these challenges by providing a comprehensive, automated, and dynamic evaluation framework that better reflects the wide range of user queries encountered in real-world settings. This is achieved through meticulous task selection from the WildChat dataset and innovative evaluation strategies involving pairwise and individual comparisons.
Methodology
Data Curation:
The WildBench dataset is curated from the WildChat project, which contains a million human-chatbot dialogues, ensuring a rich diversity of tasks such as writing assistance, coding, and data analysis. The tasks undergo rigorous filtering to maintain quality and relevance, including steps to remove nonsensical and redundant queries. Advanced LLMs are used to rate task difficulty, excluding easy tasks, and ensuring a natural distribution of task categories.
Evaluation Metrics:
- WB-Reward: Based on pairwise model comparisons using three baseline models of varying performance levels to ensure robustness. A length penalty method is introduced to mitigate biases towards longer responses.
- WB-Score: An individual scoring method that is faster and cost-effective. Scores are adjusted to better differentiate model performances.
Results and Analysis
The paper highlights the performance of various models on WildBench, noting that the evaluations correlate strongly with human judgments from Chatbot Arena. WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models, while WB-Score reaches 0.95, outperforming other benchmarks such as ArenaHard and AlpacaEval2.0.
Interestingly, WildBench reveals that models with similar output lengths can have markedly different performance scores, indicating that response quality, rather than length, is the critical factor in evaluations. This robustness to length bias is a notable strength of the proposed framework.
Implications
WildBench provides a robust, realistic, and dynamic benchmark for evaluating LLMs. Its strong correlation with human evaluations indicates that it can serve as a reliable proxy for human judgment, enabling more cost-effective and scalable assessments. This has significant practical implications, allowing developers to efficiently benchmark and improve LLMs in ways that are aligned with real-world user expectations and needs.
Future Directions
Future work could include:
- Human-in-the-Loop Benchmarking: Integrating human feedback to refine LLM judge decisions.
- Multiple Judges Mixture: Aggregating decisions from multiple LLM judges for a more reliable overall ranking.
- Dynamic Updating Leaderboards: Implementing mechanisms to automatically update the leaderboard with new models and tasks.
Conclusion
WildBench provides a comprehensive solution for automated LLM evaluation using real-world user queries. With its strong alignment with human judgment and robust mitigation of length bias, it sets a new standard for practical and reliable LLM benchmarking, paving the way for continuous improvements in LLM development and application.