Papers
Topics
Authors
Recent
Search
2000 character limit reached

MBPP Pro: Advanced Code Generation Benchmark

Updated 7 June 2026
  • MBPP Pro is an advanced benchmark that evaluates large language models on compositional code generation, emphasizing self-invocation and function-chaining.
  • It employs a two-stage process with a base problem and a self-invoking task, ensuring rigorous assessment through automated generation and human validation.
  • Performance tests reveal significant drops in zero-shot evaluations, highlighting current limitations in chaining and code reuse, with modest improvements via 1-shot prompting.

MBPP Pro is an advanced code generation benchmark designed to evaluate LLMs on self-invoking code generation, a task that tests the models’ capabilities in progressive reasoning, problem decomposition, and the correct reuse of generated subroutines. Each MBPP Pro instance extends a standard MBPP problem by adding a more complex “pro” problem that requires reusing the base solution as a subroutine. This design exposes weaknesses in current LLMs’ function-chaining and self-invocation abilities, providing a more rigorous assessment beyond traditional single-function benchmarks (Yu et al., 2024).

1. Definition and Structure of MBPP Pro

MBPP Pro redefines the Multiple Problems for Python Benchmark (MBPP) by introducing a two-stage, self-invoking code generation paradigm for each benchmark item:

  • Base Problem: A standard MBPP function-generation task, with a canonical function signature and associated unit tests.
  • Self-invoking Problem: A new, more complex task necessitating the invocation of the base solution, typically as a subroutine, often requiring multiple calls.
  • Joint Test Suite: A consolidated suite evaluating both the correctness of the base function and its use within the pro function.

MBPP Pro measures not only the ability to produce correct base implementations but also the capacity to utilize prior outputs as callable elements in larger compositions. For example, the base problem might require counting primes up to nn, while the pro problem demands applying this function to a list, returning a list of counts via repeated calls to the base function.

2. Dataset Construction Pipeline

The MBPP Pro benchmark is produced through a structured, three-stage pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for each (base_problem, base_solution) in MBPP:
    prompt = self_invoking_prompt(base_problem, base_solution)
    new_problem, new_solution, test_inputs = DeepseekV2.5(prompt)
    record (base_problem, new_problem, new_solution, test_inputs)

for each candidate in generated_candidates:
    outputs = run_in_sandbox(candidate.new_solution, candidate.test_inputs)
    if runtime_error(outputs):
        human_fix(candidate.new_solution)

for each candidate in generated_candidates:
    tests = [f"assert {call} == {out}" for call, out in zip(candidate.calls, outputs)]
    human_review_and_refine(tests)
    save final_problem, final_solution, tests

In LaTeX-style pseudocode: $\begin{algorithmic}[1] \For{each base instance %%%%1%%%% in MBPP} \State %%%%2%%%% \State %%%%3%%%% \If{%%%%4%%%% error in %%%%5%%%%} \State %%%%6%%%% \EndIf \State %%%%7%%%% \State human\_review(%%%%8%%%%) \State store %%%%9%%%% \EndFor \end{algorithmic}$

This pipeline ensures reproducibility and correctness through both automated generation (using DeepseekV2.5) and subsequent human validation.

3. Dataset Statistics and Complexity

MBPP Pro maintains a consistent size with the original MBPP:

Property MBPP MBPP Pro
Number of problems 500 500
Problems per example 1 2 (base and self-invoking)
Test cases per example 3–5 3–5
Mean solution lines ≈ 7 ≈ 11–12
  • Solution Length: Canonical pro solutions increase by approximately 60% in line count relative to base solutions, reflecting the added compositional complexity.
  • IO Size: The typical MBPP base test involves small lists or single integers, while MBPP Pro test inputs frequently incorporate nested lists or dictionaries, resulting in roughly a 1.5× increase in mean input size.

4. Representative Example

A canonical MBPP Pro instance illustrates the benchmark’s compositional demand:

Base Problem (ID 087):

  • Function: def count_primes(n: int) -> int:
  • Task: Count primes ≤ n.

Pro Problem:

  • “Given a list of integers, return a list of counts of primes up to each integer in the list. You must call your count_primes function on each element.”

Canonical Solutions:

1
2
3
4
5
6
7
8
9
10
def count_primes(n):
    def is_prime(k):
        if k<2: return False
        for i in range(2,int(k**0.5)+1):
            if k%i==0: return False
        return True
    return sum(1 for i in range(2,n+1) if is_prime(i))

def count_primes_list(nums):
    return [count_primes(x) for x in nums]
Tests:

1
2
3
assert count_primes_list([10, 5, 1]) == [4, 3, 0]
assert count_primes_list([]) == []
assert count_primes_list([2, 3, 11]) == [1, 1, 5]

5. Evaluation Metrics and Protocol

The primary evaluation metric is the “pass@k” formula, consistent with HumanEval and MBPP:

pass@k=1i=0k1ncini\mathrm{pass}@k = 1 - \prod_{i=0}^{k-1}\frac{n-c-i}{n-i}

where nn is the number of completions and cc is the number of correct generations. Evaluations are performed in both zero-shot and 1-shot settings. For a subset of models, pass@5 and pass@10 are reported (n=20n=20, T=0.2T=0.2, top_p=0.95).

6. Model Performance and Error Analysis

Experimental results demonstrate a pronounced decline from MBPP to MBPP Pro, particularly in zero-shot pass@1:

Model MBPP pass@1 MBPP Pro pass@1 Absolute Drop
o1-mini 93.9% 68.3% −25.6 pp
GPT-4o 86.8% 70.9% −15.9 pp
GPT-4-Turbo 85.7% 69.3% −16.4 pp
Claude-3.5-sonnet 91.0% 66.4% −24.6 pp

Across more than 20 models, the average absolute drop is 10–15 percentage points. Even the best models, which achieve >90% on MBPP, typically fall below 75% on MBPP Pro. 1-shot prompting yields a mean improvement of approximately 10 percentage points for MBPP Pro, but absolute performance remains well below original MBPP levels.

7. Failure Modes and Recommendations

MBPP Pro systematically reveals distinct categories of execution failure, summarized below:

Error Type Description
AssertionError Code compiles but fails the test suite
NameError Invocation of undefined variable/function
ValueError Incorrect number/type of values for unpacking
IndexError Out-of-bounds list indexing
TypeError Invalid operand types (e.g., string**int)
OtherError KeyError, SyntaxError, ZeroDivisionError, etc.

Figure 1 in (Yu et al., 2024) shows AssertionErrors are responsible for approximately 50% of MBPP Pro failures, with NameErrors and TypeErrors forming the majority of the remainder. “Chaining errors”—where the base function is correctly solved but incorrectly called—are a failure mode unique to self-invoking tasks.

The authors report that Chain-of-Thought (CoT) prompting produces modest, consistent reductions in AssertionErrors and NameErrors (~10%), and that current instruction-fine-tuning yields only marginal benefit for self-invoking tasks. This suggests an explicit focus on “function-chaining” and progressive reasoning in future data and architectural priors. Dedicated fine-tuning curricula targeting self-invocation and enhanced tracking of previously generated code are specifically recommended. Release of BigCodeBench-Lite Pro provides evidence that these trends generalize beyond MBPP.

MBPP Pro thus provides a reproducible, challenging framework for evaluating and advancing LLMs in multi-function, compositional code generation scenarios, highlighting the current gap in chaining and code reuse abilities (Yu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MBPP Pro.