BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions (2406.15877v3)

Published 22 Jun 2024 in cs.SE, cs.AI, and cs.CL

Abstract: Task automation has been greatly empowered by the recent advances in LLMs via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

PDF HTML Abstract

Evaluating LLM Capabilities with BigCodeBench

The paper presents BigCodeBench, a novel and rigorously structured benchmark designed to evaluate the ability of LLMs to tackle complex and practical programming tasks. It focuses specifically on two critical aspects: employing diverse function calls from domain-specific libraries and following intricate instructions requiring compositional reasoning.

Construction and Importance of BigCodeBench

The authors identified a gap in current benchmarks, such as HumanEval and MBPP, which primarily feature short, self-contained algorithmic challenges that models have begun to saturate. These tasks do not adequately simulate the complexities encountered in realistic programming environments. Thus, BigCodeBench was assembled to bridge this gap, featuring 1,140 tasks that demand the use of function calls from 139 libraries across seven domains. This benchmark necessitates LLMs to perform real-world programming tasks, focusing on practical applications like web development and data analysis.

BigCodeBench stands out because it not only requires an understanding of algorithms but also tests how effectively LLMs can integrate and apply external libraries to solve problems—mirroring practical software engineering scenarios. The usage of extensive debugger-like evaluations (5.6 test cases per task with 99% branch coverage on average) ensures that the model thoroughly understands and solves the given tasks.

Evaluation of LLMs

The benchmark assesses both instruction-tuned and base LLMs, measuring performance using the unbiased Pass@ $K$ metric. A key discovery is that while LLMs are proficient in certain domains, their performance on BigCodeBench tasks is far from human-level, with the best models scoring approximately 60% compared to human performance at 97%. This discrepancy not only highlights the complexity of the tasks but also suggests areas where LLMs require improvement.

Observations and Findings

Model Scaling and Performance: Higher parameter counts generally correlate with improved performance, showcasing characteristics of scaling laws in LLM training.
Closed vs. Open Models: Proprietary models like those from OpenAI and Anthropic generally outperform open-source alternatives, with GPT-4o leading in performance.
Domain-Specific Challenges: LLMs excel in domains such as computation and cryptography but struggle with others like networking, indicating areas where models could benefit from domain-specific tuning.
Instruction Following: A notable challenge for LLMs is accurately following detailed instructions for complex tasks. Instruction tuning improves performance, but there remains a significant gap to bridge.

Future Work and Implications

The research underscores the necessity for enhanced LLMs with better generalization abilities and improved instruction adherence. The authors propose continuous development of BigCodeBench to include emerging libraries and tasks, and suggest exploring more dynamic environments where LLMs function as agents interacting with different tools and services.

By addressing these gaps, the paper not only highlights the current capabilities and limitations of LLMs in software engineering applications but also sets a foundation for future advancements in the field. The introduction of BigCodeBench is poised to guide researchers and developers toward the development of more robust and versatile LLMs capable of undertaking real-world software engineering challenges.