BENCHAGENTS: Automated Benchmark Creation with Agent Interaction (2410.22584v1)
Abstract: Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BENCHAGENTS, a framework that methodically leverages LLMs to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. BENCHAGENTS decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
- Kitab: Evaluating llms on constraint satisfaction for information retrieval. arXiv preprint arXiv:2310.15511.
- Anthropic. 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet. Accessed: 2024-08-13.
- Assessing and verifying task utility in llm-powered applications. Preprint, arXiv:2405.02178.
- Eureka: Evaluating and understanding large foundation models. arXiv preprint arXiv:2409.10566.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Autobencher: Creating salient, novel, difficult datasets for language models. Preprint, arXiv:2407.08351.
- Textbooks are all you need ii: phi-1.5 technical report. Preprint, arXiv:2309.05463.
- Synthetic data generation with large language models for text classification: Potential and limitations. Preprint, arXiv:2310.07849.
- MistralAI. 2024. Mistral large 2. https://mistral.ai/news/mistral-large-2407/. Accessed: 2024-08-13.
- Agentinstruct: Toward generative teaching with agentic flows. Preprint, arXiv:2407.03502.
- OpenAI. 2024a. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/. Accessed: 2024-10-14.
- OpenAI. 2024b. Openai o1 system card. https://openai.com/index/openai-o1-system-card/. Accessed: 2024-10-14.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
- Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314.
- Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of lrm o1. Preprint, arXiv:2410.02162.
- Replacing judges with juries: Evaluating llm generations with a panel of diverse models. Preprint, arXiv:2404.18796.
- Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. Preprint, arXiv:2402.11443.
- Top leaderboard ranking = top coding proficiency, always? evoeval: Evolving coding benchmarks via llm. Preprint, arXiv:2403.19114.
- Fofo: A benchmark to evaluate llms’ format-following capability. Preprint, arXiv:2402.18667.
- Collie: Systematic construction of constrained text generation tasks. Preprint, arXiv:2307.08689.
- S-eval: Automatic and adaptive test generation for benchmarking safety evaluation of large language models. Preprint, arXiv:2405.14191.
- Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098.
- Language model developers should report train-test overlap. Preprint, arXiv:2410.08385.
- Task me anything. Preprint, arXiv:2406.11775.
- Natural plan: Benchmarking llms on natural language planning. Preprint, arXiv:2406.04520.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685.
- Instruction-following evaluation for large language models. Preprint, arXiv:2311.07911.
- Dyval: Dynamic evaluation of large language models for reasoning tasks. Preprint, arXiv:2309.17167.
- Natasha Butt (2 papers)
- Varun Chandrasekaran (39 papers)
- Neel Joshi (26 papers)
- Besmira Nushi (38 papers)
- Vidhisha Balachandran (31 papers)