From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline (2406.11939v2)

Published 17 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The rapid evolution of LLMs has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark's alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

PDF HTML Abstract

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

The paper "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline" addresses significant challenges in evaluating the capabilities of LLMs amidst their rapid advancement. Existing static benchmarks are inadequate for consistently distinguishing between competitive models and often misalign with user preferences, underscoring the need for a refined approach to model evaluation. This research introduces BenchBuilder, a dynamic pipeline that extracts high-quality benchmarks from crowdsourced data, specifically leveraging real-world user interactions from platforms like Chatbot Arena.

Problem Statement and Objective

The objective is twofold: first, to create an evaluation benchmark that can confidently separate model performance levels, and second, to ensure alignment with human preferences. Live crowdsourcing platforms, while rich in natural prompts and feedback, lack quality control, which limits their reliability in distinguishing state-of-the-art models. BenchBuilder was devised to overcome these limitations by filtering sophisticated, domain-specific prompts, thus generating benchmarks that remain relevant and challenging.

Methodology

BenchBuilder operates by first identifying key qualities indicative of high-quality prompts, including specificity, domain knowledge, complexity, problem-solving, creativity, technical accuracy, and real-world application. An LLM annotator then scores prompts against these criteria, ensuring that only prompts meeting a high standard are included in the benchmark. This selection process is supported by topic modeling to maintain a diverse distribution of tasks.

Further, BenchBuilder employs a fully automated LLM judge for assessing model responses to these benchmarks, ensuring a self-updating evaluation system. This automation reduces the cost and complexity associated with employing human evaluators.

Key Findings and Results

The utilization of BenchBuilder on data harvested from Chatbot Arena culminated in the Arena-Hard-Auto v0.1 dataset, a compilation of 500 challenging prompts. Arena-Hard-Auto v0.1 demonstrates superior evaluative power by achieving 3x tighter confidence intervals than existing benchmarks like MT-Bench, alongside an 89.1% agreement rate with human preference rankings. This indicates its robust capability to discriminate between high-quality models and align with user expectations at a minimal cost and without human labelers.

The paper further introduces novel metrics to measure benchmark effectiveness. These include Separability with Confidence, Agreement with Confidence Interval, and Pair Rank Brier Score, which collectively offer an integrated approach for evaluating LLM benchmarks in terms of their ability to rank models accurately and confidently.

Implications and Future Directions

The implications of BenchBuilder are substantial both theoretically and practically. The pipeline provides a scalable method to generate high-quality benchmarks, facilitating the continual assessment and improvement of LLMs. It also sets a precedent for other researchers to use live data sources in evaluating and advancing AI models.

Theoretically, the paper suggests a paradigm shift towards dynamic benchmarks that evolve with model capabilities, reducing the risk of overfitting and leakage endemic to static datasets. Practically, this research empowers developers by providing tools that require minimal intervention while significantly enhancing the reliability of model evaluations.

Looking ahead, future research could focus on expanding the scope of BenchBuilder and Arena-Hard by incorporating multi-turn dialogues and non-English queries, further widening its applicability and robustness. Additionally, exploration into combining human and machine evaluations could refine the correlation between LLM judgments and authentic human preferences, enhancing the interpretability and trustworthiness of automated systems.

In summary, this paper provides a vital contribution to the domain of LLM evaluation, proposing a novel and effective approach for extracting and utilizing high-quality benchmarks from live audience interactions, aptly addressing the needs of a rapidly evolving AI landscape.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Tianle Li (25 papers)
Wei-Lin Chiang (19 papers)
Evan Frick (3 papers)
Lisa Dunlap (13 papers)
Tianhao Wu (68 papers)
Banghua Zhu (38 papers)
Joseph E. Gonzalez (167 papers)
Ion Stoica (177 papers)

Citations (56)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/javaeeeee1/status/1804983994926145955

YouTube

Show All Videos