BENCHAGENTS: Automated Benchmark Creation with Agent Interaction (2410.22584v1)

Published 29 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BENCHAGENTS, a framework that methodically leverages LLMs to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. BENCHAGENTS decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.

References (28)

Authors (5)

Natasha Butt (2 papers)
Varun Chandrasekaran (39 papers)
Neel Joshi (26 papers)
Besmira Nushi (38 papers)
Vidhisha Balachandran (31 papers)

Summary

An Overview of BenchAgents: A Multi-Agent Framework for Automated Benchmark Creation

The research paper "BenchAgents: Automated Benchmark Creation with Agent Interaction" provides a noteworthy contribution to the development and evaluation of LLMs by introducing BenchAgents, a comprehensive framework aimed at automating the creation of benchmarks through the interactive deployment of multiple LLM agents. This approach addresses the inefficiencies inherent in traditional benchmark creation, which often rely on time-consuming and costly human annotation processes.

BenchAgents stands out by leveraging LLMs across a structured, multi-agent system to streamline the generation of benchmarks, particularly for complex and generative tasks. The framework partitions the benchmark creation process into four essential stages managed by specific agents: planning, generation, verification, and evaluation. Each LLM agent, designated as P-Agent, G-Agent, V-Agent, and E-Agent, respectively, interacts with one another and incorporates human-in-the-loop feedback, thus allowing for dynamic adjustments and quality control.

The P-Agent initiates the process by generating a high-level plan based on task descriptions and optional seed prompts provided by developers. This plan details the parameters and constraints needed for data generation, offering guidance for subsequent agents. The G-Agent utilizes this plan to programmatically generate diverse benchmark data, while the V-Agent applies a suite of verification checks—spanning clarity, completeness, consistency, feasibility, and complexity—to ensure data quality. Finally, the E-Agent establishes evaluation metrics to judge model performance against the generated benchmarks.

Two specific benchmarks, BA-Calendar and BA-Text, are developed within this framework, focusing on complex planning tasks such as calendar scheduling and constrained text generation. These benchmarks illuminate common failure modes and inconsistencies in constraint satisfaction among various state-of-the-art LLMs, with empirical assessments revealing notable performance deficits when faced with multiple, concurrent constraints.

The implications of BenchAgents are significant for both theoretical exploration and practical application in AI model evaluation. By automating and diversifying benchmark creation, the framework not only alleviates the bottlenecks of manual data curation but also enhances the robustness and scalability of evaluations for emerging LLM capabilities. Researchers can leverage this tool to derive fine-grained insights into model behavior, facilitating advancements in model design and training paradigms.

Furthermore, BenchAgents heralds a shift towards more agile and adaptable benchmarking procedures, offering potential adaptability to a broader range of NLP tasks. Future AI developments may see enhanced incorporation of multi-agent systems and hybrid approaches combining LLM capabilities with traditional code execution frameworks as delineated in BenchAgents.

Overall, this research delineates a well-constructed methodology for benchmark generation, augmenting our understanding of the intricacies involved in LLM evaluations and addressing key challenges concerning quality, diversity, and dynamic benchmark adaptability. BenchAgents presents an invaluable resource for the AI research community, providing a scalable solution to the evolving demands of LLM assessment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/NatashaEve4/status/1853470223996739637

https://twitter.com/mctalentowen/status/1851994461926195689