MBPP Pro: Advanced Code Generation Benchmark
- MBPP Pro is an advanced benchmark that evaluates large language models on compositional code generation, emphasizing self-invocation and function-chaining.
- It employs a two-stage process with a base problem and a self-invoking task, ensuring rigorous assessment through automated generation and human validation.
- Performance tests reveal significant drops in zero-shot evaluations, highlighting current limitations in chaining and code reuse, with modest improvements via 1-shot prompting.
MBPP Pro is an advanced code generation benchmark designed to evaluate LLMs on self-invoking code generation, a task that tests the models’ capabilities in progressive reasoning, problem decomposition, and the correct reuse of generated subroutines. Each MBPP Pro instance extends a standard MBPP problem by adding a more complex “pro” problem that requires reusing the base solution as a subroutine. This design exposes weaknesses in current LLMs’ function-chaining and self-invocation abilities, providing a more rigorous assessment beyond traditional single-function benchmarks (Yu et al., 2024).
1. Definition and Structure of MBPP Pro
MBPP Pro redefines the Multiple Problems for Python Benchmark (MBPP) by introducing a two-stage, self-invoking code generation paradigm for each benchmark item:
- Base Problem: A standard MBPP function-generation task, with a canonical function signature and associated unit tests.
- Self-invoking Problem: A new, more complex task necessitating the invocation of the base solution, typically as a subroutine, often requiring multiple calls.
- Joint Test Suite: A consolidated suite evaluating both the correctness of the base function and its use within the pro function.
MBPP Pro measures not only the ability to produce correct base implementations but also the capacity to utilize prior outputs as callable elements in larger compositions. For example, the base problem might require counting primes up to , while the pro problem demands applying this function to a list, returning a list of counts via repeated calls to the base function.
2. Dataset Construction Pipeline
The MBPP Pro benchmark is produced through a structured, three-stage pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
for each (base_problem, base_solution) in MBPP: prompt = self_invoking_prompt(base_problem, base_solution) new_problem, new_solution, test_inputs = DeepseekV2.5(prompt) record (base_problem, new_problem, new_solution, test_inputs) for each candidate in generated_candidates: outputs = run_in_sandbox(candidate.new_solution, candidate.test_inputs) if runtime_error(outputs): human_fix(candidate.new_solution) for each candidate in generated_candidates: tests = [f"assert {call} == {out}" for call, out in zip(candidate.calls, outputs)] human_review_and_refine(tests) save final_problem, final_solution, tests |
In LaTeX-style pseudocode: $\begin{algorithmic}[1] \For{each base instance %%%%1%%%% in MBPP} \State %%%%2%%%% \State %%%%3%%%% \If{%%%%4%%%% error in %%%%5%%%%} \State %%%%6%%%% \EndIf \State %%%%7%%%% \State human\_review(%%%%8%%%%) \State store %%%%9%%%% \EndFor \end{algorithmic}$
This pipeline ensures reproducibility and correctness through both automated generation (using DeepseekV2.5) and subsequent human validation.
3. Dataset Statistics and Complexity
MBPP Pro maintains a consistent size with the original MBPP:
| Property | MBPP | MBPP Pro |
|---|---|---|
| Number of problems | 500 | 500 |
| Problems per example | 1 | 2 (base and self-invoking) |
| Test cases per example | 3–5 | 3–5 |
| Mean solution lines | ≈ 7 | ≈ 11–12 |
- Solution Length: Canonical pro solutions increase by approximately 60% in line count relative to base solutions, reflecting the added compositional complexity.
- IO Size: The typical MBPP base test involves small lists or single integers, while MBPP Pro test inputs frequently incorporate nested lists or dictionaries, resulting in roughly a 1.5× increase in mean input size.
4. Representative Example
A canonical MBPP Pro instance illustrates the benchmark’s compositional demand:
Base Problem (ID 087):
- Function:
def count_primes(n: int) -> int: - Task: Count primes ≤ n.
Pro Problem:
- “Given a list of integers, return a list of counts of primes up to each integer in the list. You must call your
count_primesfunction on each element.”
Canonical Solutions:
1 2 3 4 5 6 7 8 9 10 |
def count_primes(n): def is_prime(k): if k<2: return False for i in range(2,int(k**0.5)+1): if k%i==0: return False return True return sum(1 for i in range(2,n+1) if is_prime(i)) def count_primes_list(nums): return [count_primes(x) for x in nums] |
1 2 3 |
assert count_primes_list([10, 5, 1]) == [4, 3, 0] assert count_primes_list([]) == [] assert count_primes_list([2, 3, 11]) == [1, 1, 5] |
5. Evaluation Metrics and Protocol
The primary evaluation metric is the “pass@k” formula, consistent with HumanEval and MBPP:
where is the number of completions and is the number of correct generations. Evaluations are performed in both zero-shot and 1-shot settings. For a subset of models, pass@5 and pass@10 are reported (, , top_p=0.95).
6. Model Performance and Error Analysis
Experimental results demonstrate a pronounced decline from MBPP to MBPP Pro, particularly in zero-shot pass@1:
| Model | MBPP pass@1 | MBPP Pro pass@1 | Absolute Drop |
|---|---|---|---|
| o1-mini | 93.9% | 68.3% | −25.6 pp |
| GPT-4o | 86.8% | 70.9% | −15.9 pp |
| GPT-4-Turbo | 85.7% | 69.3% | −16.4 pp |
| Claude-3.5-sonnet | 91.0% | 66.4% | −24.6 pp |
Across more than 20 models, the average absolute drop is 10–15 percentage points. Even the best models, which achieve >90% on MBPP, typically fall below 75% on MBPP Pro. 1-shot prompting yields a mean improvement of approximately 10 percentage points for MBPP Pro, but absolute performance remains well below original MBPP levels.
7. Failure Modes and Recommendations
MBPP Pro systematically reveals distinct categories of execution failure, summarized below:
| Error Type | Description |
|---|---|
| AssertionError | Code compiles but fails the test suite |
| NameError | Invocation of undefined variable/function |
| ValueError | Incorrect number/type of values for unpacking |
| IndexError | Out-of-bounds list indexing |
| TypeError | Invalid operand types (e.g., string**int) |
| OtherError | KeyError, SyntaxError, ZeroDivisionError, etc. |
Figure 1 in (Yu et al., 2024) shows AssertionErrors are responsible for approximately 50% of MBPP Pro failures, with NameErrors and TypeErrors forming the majority of the remainder. “Chaining errors”—where the base function is correctly solved but incorrectly called—are a failure mode unique to self-invoking tasks.
The authors report that Chain-of-Thought (CoT) prompting produces modest, consistent reductions in AssertionErrors and NameErrors (~10%), and that current instruction-fine-tuning yields only marginal benefit for self-invoking tasks. This suggests an explicit focus on “function-chaining” and progressive reasoning in future data and architectural priors. Dedicated fine-tuning curricula targeting self-invocation and enhanced tracking of previously generated code are specifically recommended. Release of BigCodeBench-Lite Pro provides evidence that these trends generalize beyond MBPP.
MBPP Pro thus provides a reproducible, challenging framework for evaluating and advancing LLMs in multi-function, compositional code generation scenarios, highlighting the current gap in chaining and code reuse abilities (Yu et al., 2024).