Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction (2410.22584v1)

Published 29 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BENCHAGENTS, a framework that methodically leverages LLMs to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. BENCHAGENTS decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Kitab: Evaluating llms on constraint satisfaction for information retrieval. arXiv preprint arXiv:2310.15511.
  2. Anthropic. 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet. Accessed: 2024-08-13.
  3. Assessing and verifying task utility in llm-powered applications. Preprint, arXiv:2405.02178.
  4. Eureka: Evaluating and understanding large foundation models. arXiv preprint arXiv:2409.10566.
  5. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
  6. Autobencher: Creating salient, novel, difficult datasets for language models. Preprint, arXiv:2407.08351.
  7. Textbooks are all you need ii: phi-1.5 technical report. Preprint, arXiv:2309.05463.
  8. Synthetic data generation with large language models for text classification: Potential and limitations. Preprint, arXiv:2310.07849.
  9. MistralAI. 2024. Mistral large 2. https://mistral.ai/news/mistral-large-2407/. Accessed: 2024-08-13.
  10. Agentinstruct: Toward generative teaching with agentic flows. Preprint, arXiv:2407.03502.
  11. OpenAI. 2024a. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/. Accessed: 2024-10-14.
  12. OpenAI. 2024b. Openai o1 system card. https://openai.com/index/openai-o1-system-card/. Accessed: 2024-10-14.
  13. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  14. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314.
  15. Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of lrm o1. Preprint, arXiv:2410.02162.
  16. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. Preprint, arXiv:2404.18796.
  17. Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. Preprint, arXiv:2402.11443.
  18. Top leaderboard ranking = top coding proficiency, always? evoeval: Evolving coding benchmarks via llm. Preprint, arXiv:2403.19114.
  19. Fofo: A benchmark to evaluate llms’ format-following capability. Preprint, arXiv:2402.18667.
  20. Collie: Systematic construction of constrained text generation tasks. Preprint, arXiv:2307.08689.
  21. S-eval: Automatic and adaptive test generation for benchmarking safety evaluation of large language models. Preprint, arXiv:2405.14191.
  22. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098.
  23. Language model developers should report train-test overlap. Preprint, arXiv:2410.08385.
  24. Task me anything. Preprint, arXiv:2406.11775.
  25. Natural plan: Benchmarking llms on natural language planning. Preprint, arXiv:2406.04520.
  26. Judging llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685.
  27. Instruction-following evaluation for large language models. Preprint, arXiv:2311.07911.
  28. Dyval: Dynamic evaluation of large language models for reasoning tasks. Preprint, arXiv:2309.17167.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Natasha Butt (2 papers)
  2. Varun Chandrasekaran (39 papers)
  3. Neel Joshi (26 papers)
  4. Besmira Nushi (38 papers)
  5. Vidhisha Balachandran (31 papers)

Summary

An Overview of BenchAgents: A Multi-Agent Framework for Automated Benchmark Creation

The research paper "BenchAgents: Automated Benchmark Creation with Agent Interaction" provides a noteworthy contribution to the development and evaluation of LLMs by introducing BenchAgents, a comprehensive framework aimed at automating the creation of benchmarks through the interactive deployment of multiple LLM agents. This approach addresses the inefficiencies inherent in traditional benchmark creation, which often rely on time-consuming and costly human annotation processes.

BenchAgents stands out by leveraging LLMs across a structured, multi-agent system to streamline the generation of benchmarks, particularly for complex and generative tasks. The framework partitions the benchmark creation process into four essential stages managed by specific agents: planning, generation, verification, and evaluation. Each LLM agent, designated as P-Agent, G-Agent, V-Agent, and E-Agent, respectively, interacts with one another and incorporates human-in-the-loop feedback, thus allowing for dynamic adjustments and quality control.

The P-Agent initiates the process by generating a high-level plan based on task descriptions and optional seed prompts provided by developers. This plan details the parameters and constraints needed for data generation, offering guidance for subsequent agents. The G-Agent utilizes this plan to programmatically generate diverse benchmark data, while the V-Agent applies a suite of verification checks—spanning clarity, completeness, consistency, feasibility, and complexity—to ensure data quality. Finally, the E-Agent establishes evaluation metrics to judge model performance against the generated benchmarks.

Two specific benchmarks, BA-Calendar and BA-Text, are developed within this framework, focusing on complex planning tasks such as calendar scheduling and constrained text generation. These benchmarks illuminate common failure modes and inconsistencies in constraint satisfaction among various state-of-the-art LLMs, with empirical assessments revealing notable performance deficits when faced with multiple, concurrent constraints.

The implications of BenchAgents are significant for both theoretical exploration and practical application in AI model evaluation. By automating and diversifying benchmark creation, the framework not only alleviates the bottlenecks of manual data curation but also enhances the robustness and scalability of evaluations for emerging LLM capabilities. Researchers can leverage this tool to derive fine-grained insights into model behavior, facilitating advancements in model design and training paradigms.

Furthermore, BenchAgents heralds a shift towards more agile and adaptable benchmarking procedures, offering potential adaptability to a broader range of NLP tasks. Future AI developments may see enhanced incorporation of multi-agent systems and hybrid approaches combining LLM capabilities with traditional code execution frameworks as delineated in BenchAgents.

Overall, this research delineates a well-constructed methodology for benchmark generation, augmenting our understanding of the intricacies involved in LLM evaluations and addressing key challenges concerning quality, diversity, and dynamic benchmark adaptability. BenchAgents presents an invaluable resource for the AI research community, providing a scalable solution to the evolving demands of LLM assessment.