MBPP Pro: Advanced Code Generation Benchmark

Updated 7 June 2026

MBPP Pro is an advanced benchmark that evaluates large language models on compositional code generation, emphasizing self-invocation and function-chaining.
It employs a two-stage process with a base problem and a self-invoking task, ensuring rigorous assessment through automated generation and human validation.
Performance tests reveal significant drops in zero-shot evaluations, highlighting current limitations in chaining and code reuse, with modest improvements via 1-shot prompting.

MBPP Pro is an advanced code generation benchmark designed to evaluate LLMs on self-invoking code generation, a task that tests the models’ capabilities in progressive reasoning, problem decomposition, and the correct reuse of generated subroutines. Each MBPP Pro instance extends a standard MBPP problem by adding a more complex “pro” problem that requires reusing the base solution as a subroutine. This design exposes weaknesses in current LLMs’ function-chaining and self-invocation abilities, providing a more rigorous assessment beyond traditional single-function benchmarks (Yu et al., 2024).

1. Definition and Structure of MBPP Pro

MBPP Pro redefines the Multiple Problems for Python Benchmark (MBPP) by introducing a two-stage, self-invoking code generation paradigm for each benchmark item:

Base Problem: A standard MBPP function-generation task, with a canonical function signature and associated unit tests.
Self-invoking Problem: A new, more complex task necessitating the invocation of the base solution, typically as a subroutine, often requiring multiple calls.
Joint Test Suite: A consolidated suite evaluating both the correctness of the base function and its use within the pro function.

MBPP Pro measures not only the ability to produce correct base implementations but also the capacity to utilize prior outputs as callable elements in larger compositions. For example, the base problem might require counting primes up to $n$ , while the pro problem demands applying this function to a list, returning a list of counts via repeated calls to the base function.

2. Dataset Construction Pipeline

The MBPP Pro benchmark is produced through a structured, three-stage pipeline:

for each (base_problem, base_solution) in MBPP:
    prompt = self_invoking_prompt(base_problem, base_solution)
    new_problem, new_solution, test_inputs = DeepseekV2.5(prompt)
    record (base_problem, new_problem, new_solution, test_inputs)

for each candidate in generated_candidates:
    outputs = run_in_sandbox(candidate.new_solution, candidate.test_inputs)
    if runtime_error(outputs):
        human_fix(candidate.new_solution)

for each candidate in generated_candidates:
    tests = [f"assert {call} == {out}" for call, out in zip(candidate.calls, outputs)]
    human_review_and_refine(tests)
    save final_problem, final_solution, tests

In LaTeX-style pseudocode: $\begin{algorithmic}[1] \For{each base instance %%%%1%%%% in MBPP} \State %%%%2%%%% \State %%%%3%%%% \If{%%%%4%%%% error in %%%%5%%%%} \State %%%%6%%%% \EndIf \State %%%%7%%%% \State human\_review(%%%%8%%%%) \State store %%%%9%%%% \EndFor \end{algorithmic}$

This pipeline ensures reproducibility and correctness through both automated generation (using DeepseekV2.5) and subsequent human validation.

3. Dataset Statistics and Complexity

MBPP Pro maintains a consistent size with the original MBPP:

Property	MBPP	MBPP Pro
Number of problems	500	500
Problems per example	1	2 (base and self-invoking)
Test cases per example	3–5	3–5
Mean solution lines	≈ 7	≈ 11–12

Solution Length: Canonical pro solutions increase by approximately 60% in line count relative to base solutions, reflecting the added compositional complexity.
IO Size: The typical MBPP base test involves small lists or single integers, while MBPP Pro test inputs frequently incorporate nested lists or dictionaries, resulting in roughly a 1.5× increase in mean input size.

4. Representative Example

A canonical MBPP Pro instance illustrates the benchmark’s compositional demand:

Base Problem (ID 087):

Function: def count_primes(n: int) -> int:
Task: Count primes ≤ n.

Pro Problem:

“Given a list of integers, return a list of counts of primes up to each integer in the list. You must call your count_primes function on each element.”

Canonical Solutions:

def count_primes(n):
    def is_prime(k):
        if k<2: return False
        for i in range(2,int(k**0.5)+1):
            if k%i==0: return False
        return True
    return sum(1 for i in range(2,n+1) if is_prime(i))

def count_primes_list(nums):
    return [count_primes(x) for x in nums]

Tests:

1
2
3

assert count_primes_list([10, 5, 1]) == [4, 3, 0]
assert count_primes_list([]) == []
assert count_primes_list([2, 3, 11]) == [1, 1, 5]

5. Evaluation Metrics and Protocol

The primary evaluation metric is the “pass@k” formula, consistent with HumanEval and MBPP:

$\mathrm{pass}@k = 1 - \prod_{i=0}^{k-1}\frac{n-c-i}{n-i}$

where $n$ is the number of completions and $c$ is the number of correct generations. Evaluations are performed in both zero-shot and 1-shot settings. For a subset of models, pass@5 and pass@10 are reported ( $n=20$ , $T=0.2$ , top_p=0.95).

6. Model Performance and Error Analysis

Experimental results demonstrate a pronounced decline from MBPP to MBPP Pro, particularly in zero-shot pass@1:

Model	MBPP pass@1	MBPP Pro pass@1	Absolute Drop
o1-mini	93.9%	68.3%	−25.6 pp
GPT-4o	86.8%	70.9%	−15.9 pp
GPT-4-Turbo	85.7%	69.3%	−16.4 pp
Claude-3.5-sonnet	91.0%	66.4%	−24.6 pp

Across more than 20 models, the average absolute drop is 10–15 percentage points. Even the best models, which achieve >90% on MBPP, typically fall below 75% on MBPP Pro. 1-shot prompting yields a mean improvement of approximately 10 percentage points for MBPP Pro, but absolute performance remains well below original MBPP levels.

7. Failure Modes and Recommendations

MBPP Pro systematically reveals distinct categories of execution failure, summarized below:

Error Type	Description
AssertionError	Code compiles but fails the test suite
NameError	Invocation of undefined variable/function
ValueError	Incorrect number/type of values for unpacking
IndexError	Out-of-bounds list indexing
TypeError	Invalid operand types (e.g., string**int)
OtherError	KeyError, SyntaxError, ZeroDivisionError, etc.

Figure 1 in (Yu et al., 2024) shows AssertionErrors are responsible for approximately 50% of MBPP Pro failures, with NameErrors and TypeErrors forming the majority of the remainder. “Chaining errors”—where the base function is correctly solved but incorrectly called—are a failure mode unique to self-invoking tasks.

The authors report that Chain-of-Thought (CoT) prompting produces modest, consistent reductions in AssertionErrors and NameErrors (~10%), and that current instruction-fine-tuning yields only marginal benefit for self-invoking tasks. This suggests an explicit focus on “function-chaining” and progressive reasoning in future data and architectural priors. Dedicated fine-tuning curricula targeting self-invocation and enhanced tracking of previously generated code are specifically recommended. Release of BigCodeBench-Lite Pro provides evidence that these trends generalize beyond MBPP.

MBPP Pro thus provides a reproducible, challenging framework for evaluating and advancing LLMs in multi-function, compositional code generation scenarios, highlighting the current gap in chaining and code reuse abilities (Yu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MBPP Pro.

MBPP Pro: Advanced Code Generation Benchmark

1. Definition and Structure of MBPP Pro

2. Dataset Construction Pipeline

3. Dataset Statistics and Complexity

4. Representative Example

5. Evaluation Metrics and Protocol

6. Model Performance and Error Analysis

7. Failure Modes and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MBPP Pro: Advanced Code Generation Benchmark

1. Definition and Structure of MBPP Pro

2. Dataset Construction Pipeline

3. Dataset Statistics and Complexity

4. Representative Example

5. Evaluation Metrics and Protocol

6. Model Performance and Error Analysis

7. Failure Modes and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research