BigCodeBench-Lite Pro Benchmark
- BigCodeBench-Lite Pro is a benchmark designed to assess LLMs on multi-stage, self-invoking code generation tasks by evaluating their ability to reuse helper functions.
- It employs a two-stage process where models first solve a base task and then address a related, more complex problem by invoking the base solution, mirroring real-world code reuse practices.
- Experimental results show a significant drop in pass@1 accuracy, underscoring challenges in compositional reasoning and orchestration in code synthesis.
BigCodeBench-Lite Pro is a benchmark designed to evaluate LLMs on self-invoking code generation tasks. In contrast to traditional code generation benchmarks, which present single-stage programming problems, BigCodeBench-Lite Pro requires models to first generate a function solving an initial ("base") task and then construct a solution to a related, more complex problem that explicitly necessitates invoking the base solution as a component. The benchmark is positioned as a diagnostic testbed probing models' ability to coordinate and reuse code in multi-stage scenarios, mirroring real-world programming workflows involving helper functions and orchestration. BigCodeBench-Lite Pro is released in conjunction with HumanEval Pro and MBPP Pro as part of a unified research effort to probe and quantify progressive reasoning and compositional capabilities in state-of-the-art LLMs (Yu et al., 2024).
1. Problem Definition and Rationale
BigCodeBench-Lite Pro is defined as a two-stage extension of the BigCodeBench-Lite dataset. For every problem, models first solve a "base" task , and then, in a "Pro" task , utilize their solution to solve a more challenging, semantically linked problem. The explicit invocation of the base solution distinguishes this benchmark from previous datasets where each problem is atomic and independent.
The central motivation for introducing BigCodeBench-Lite Pro is to bridge the gap between synthetic code generation benchmarks and practical programming, where modular design and code reuse are standard. While existing benchmarks like BigCodeBench-Lite assess correctness on medium-difficulty, stand-alone tasks, the Pro extension tests for the integration and correct invocation of generated helpers under increasing complexity and problem decomposition.
BigCodeBench-Lite Pro leverages the 57 medium-difficulty problems (defined by a 50–70% solve rate on BigCodeBench) from BigCodeBench-Lite, embedding each in a self-invoking, compositional framework that requires multi-stage reasoning and function orchestration (Yu et al., 2024).
2. Construction Methodology
The methodology for constructing BigCodeBench-Lite Pro mirrors that of HumanEval Pro and MBPP Pro. Formally, for each base problem , a deterministic transformation is defined: , where calls , often multiple times.
The construction pipeline proceeds in three key stages:
- Automatic Generation: Deepseek-V2.5 (open-source LLM) is prompted to propose:
- : a new, harder problem requiring explicit calls to .
- 0: a candidate solution implementing the composed logic.
- 1: representative input sets.
- Execution and Validation: The candidate solution is executed in a sandbox with candidate inputs, capturing output or failures.
- Human Review and Iteration: Task description, candidate solution, and test inputs are manually reviewed and iteratively refined to ensure 100% correctness of the ground-truth solution. All test cases, including edge cases, must pass and maintain semantic ties to the original problem.
Key criteria during construction include strict complexity enhancement (the Pro task must be strictly harder), semantic relevance (the helper invocation must be natural and essential), and rigorous manual vetting.
3. Dataset Content and Exemplars
BigCodeBench-Lite Pro comprises 57 self-invoking tasks, each derived from a unique BigCodeBench-Lite base problem. The underlying problem domains span geometry, array manipulations, graph primitives, and numeric algorithms.
A representative example (BigCodeBench case 355, Voronoi plotting) illustrates the construction approach:
- Base 2: Given an 3 numpy array of 2D points, compute and plot its Voronoi diagram. The function returns a pair: the
scipy.spatial.Voronoiobject and the corresponding MatplotlibAxes. - Pro 4: Divide the original point set into three equal-size sorted subsets, compute the Voronoi diagram for each via the base function, and overlay all three diagrams in one figure. The canonical Pro solution explicitly invokes the base function on each subset, then merges the renderings. Test cases are constructed to cover edge cases (three points, nine structured/random points, etc.), and correctness is asserted both by exception safety and by checking invariants (e.g., number of regions, no QhullError).
This task structure is repeated across all instances, ensuring that each requires nontrivial, semantically grounded invocation of the base solution (Yu et al., 2024).
4. Evaluation Protocol
BigCodeBench-Lite Pro is evaluated primarily via pass@k accuracy, the probability that at least one of 5 independently generated model outputs passes all test cases for a given problem. For 6, this simplifies to:
7
For general 8:
9
Evaluation protocol settings are as follows:
- Open-source models: greedy decoding.
- Proprietary/API models: temperature = 0.2, top_p = 0.95.
- Prompt formats: zero-shot ("Write a Python file ... second problem requires calls to the first"), one-shot (with worked example), and chain-of-thought ("Let's think step by step ...") variants.
A cross-section of evaluated models includes GPT-4-Turbo, GPT-4o, Claude-3.5-Sonnet, o1-mini (proprietary), Deepseek-V2.5, DeepseekCoder, Qwen2.5-Coder (base/instruct and various sizes), Magicoder, Codestral, OpenCoder, Yi-Coder, LLaMA3 variants (open-source), among others.
5. Experimental Results
Empirical results indicate a pronounced performance gap between the original and Pro benchmarks. Table 1 summarizes pass@1 scores across a representative set of models:
| Model | BCB-Lite pass@1 | BCB-Lite-Pro pass@1 |
|---|---|---|
| GPT-4o | 64.9% | 52.6% |
| GPT-4-Turbo | 61.4% | 52.6% |
| Claude-3.5-Sonnet | 73.7% | 50.9% |
| Deepseek-V2.5 | 80.7% | 50.9% |
| Qwen2.5-Coder-1.5B-base | 50.9% | 15.8% |
| Qwen2.5-Coder-1.5B-instruct | 50.9% | 10.5% |
| OpenCoder-8B-base | 56.1% | 10.5% |
| DeepseekCoder-6.7B-base | 59.6% | 35.1% |
| DeepseekCoder-6.7B-instruct | 56.1% | 35.1% |
| Qwen2.5-Coder-7B-instruct | 64.9% | 35.1% |
| DeepseekCoder-33B-instruct | 80.7% | 43.9% |
| Qwen2.5-Coder-32B-instruct | 80.7% | 52.6% |
| Codestral-22B | 78.9% | 54.4% |
Key findings:
- All evaluated models exhibit a substantial absolute drop (typically 20–40 percentage points) in pass@1 accuracy from standard BCB-Lite to the Pro version.
- Larger models maintain higher absolute accuracy on Pro tasks, but the proportional drop is consistently sizable.
- Instruction tuning yields only marginal improvements for Pro tasks, in contrast to its outsized impact on the original benchmarks.
This suggests that current SOTA LLMs, including instruction-tuned and large backbone models, struggle with compositional orchestration and function reuse compared to atomic problem-solving in code generation. (Yu et al., 2024).
6. Failure Mode Taxonomy and Analysis
Systematic failure analysis reveals six dominant error classes in generated Pro solutions:
| Error Type | Description |
|---|---|
| AssertionError | Runs, but fails one or more problem-specific assertions |
| NameError | Reference to undefined variable/function (helper misuse) |
| ValueError | Incorrect unpacking, arguments, or values |
| IndexError | Out-of-bounds array/list indexing |
| TypeError | Type mismatch operations |
| OtherError | SyntaxError, ZeroDivisionError, KeyError, etc. |
Primary observations:
- AssertionError is the most frequent, indicating that models synthesize code that executes but fails to achieve correctness under compound requirements.
- NameError ranks second, frequently resulting from misnaming or omitting required helper function invocations.
- ValueError, IndexError, and TypeError collectively highlight persistent difficulties in input reshaping and edge case handling in multi-step invocation contexts.
The occurrence of these error types in BigCodeBench-Lite Pro mirrors those in HumanEval Pro and MBPP Pro, pointing to a generalizable challenge in current LLM-based code synthesis when extended to self-invocation scenarios.
7. Significance and Research Implications
BigCodeBench-Lite Pro establishes a reproducible, challenge-focused benchmark for diagnosing compositional failure modes in LLMs' program synthesis capabilities. Its automated, LLM-driven construction methodology, underpinned by rigorous execution-based test generation and manual vetting, ensures ground-truth correctness and a precise measure of multi-stage reasoning competence. The consistently observed 20–40 point drop in pass@1 accuracy across diverse LLM families underscores self-invoking code generation as a frontier problem in code reasoning research. A plausible implication is the need for architectural and training paradigm innovations targeting composition, code reuse, and orchestration—domains where current LLMs still underperform relative to isolated code generation tasks (Yu et al., 2024).