EvalPlus: Code Synthesis Evaluation

Updated 16 February 2026

EvalPlus is a framework for rigorously evaluating LLMs by automatically expanding code test suites to detect subtle errors in generated code.
It leverages LLM-driven seed generation combined with type-aware mutations to augment benchmarks such as HumanEval and MBPP.
Its enhanced evaluation reveals significant drops in pass rates, offering more accurate model comparisons and exposing overestimated LLM performance.

EvalPlus is a rigorous code synthesis evaluation framework designed to overcome the test insufficiency and corresponding overestimation of LLM performance on code generation tasks. By providing automated, large-scale augmentation of test suites for existing programming benchmarks such as HumanEval and MBPP, EvalPlus enables significantly more stringent assessment of the functional correctness of LLM-generated code. Its architecture combines LLM-driven seed generation and type-aware mutation techniques to systematically expose subtle errors and improve robustness evaluation, driving substantial shifts in empirical pass rates and model ranking outcomes (Liu et al., 2023).

1. Motivation and Foundational Goals

The impetus for EvalPlus arises from the recognized limitations of contemporary LLM-for-code benchmarks such as HumanEval and MBPP, which rely on small, hand-authored test suites (typically fewer than ten tests per problem). This restricted coverage allows non-trivial bugs—including off-by-one errors, inadequate handling of corner cases, and performance anomalies—to evade detection, resulting in inflated pass@k scores and potentially incorrect comparative rankings among LLMs. EvalPlus aims to address these deficiencies through four primary goals:

Automatically augment any code-generation benchmark with a large and diverse set of new test cases, thoroughly probing both common and rare code behaviors.
Fuse LLM-based generation of high-quality seed inputs with systematic, type-aware mutation (akin to structured fuzzing) to create test suites that scale by an order of magnitude.
Provide a mechanism for test-suite reduction (via approximate set cover) to yield minimally sized, maximally effective subsets for rapid evaluation.
Re-benchmark state-of-the-art LLMs under these extended suites to reveal true functional correctness and facilitate accurate, reliable model comparison (Liu et al., 2023).

2. System Architecture and Workflow

EvalPlus operates through a staged process that transforms minimal hand-written test collections into large, automatically derived test suites. The workflow consists of the following pipeline:

Prompt Construction: Each problem’s specification (function signature and docstring), ground-truth reference solution, and original test cases are assembled into a prompt for a LLM (ChatGPT).
Seed Generation: The prompt instructs the LLM to synthesize a set of "interesting corner-case" inputs, filtered to respect explicit preconditions.
Type-Aware Mutation: Starting from the LLM-generated seeds, a mutation engine applies type-specific transformations to produce a large corpus of additional test inputs. Each mutated candidate is validated on the ground-truth; only those which yield valid behaviors (no assertion failures, contract violations, or type errors) are retained.
Differential Testing: The complete set of seeds and mutants forms the augmented test suite. Both the ground-truth and each LLM-generated sample are executed over all test inputs; correctness is measured by strict agreement.
Test-Suite Reduction (Optional): A greedy set-cover algorithm can condense the augmented suite into a minimal subset that preserves most of the fault-detection efficacy, enabling efficient downstream evaluation.
Benchmark Release: The process outputs an enhanced evaluation set, metrics, and annotation artifacts for open use.

Table 1. Basic Type-Aware Mutation Operators

Type	Mutation Strategy
int, float	x ± 1
bool	random selection {True, False}
NoneType	None
str	remove/repeat/replace random substring
List	remove or repeat element; replace x[i] with Mutate(x[i])
Tuple	Tuple(Mutate(list(x)))
Set	set(Mutate(list(x)))
Dict	remove kv-pair; update value; insert Mutate(k):Mutate(v)

The mutation engine also incorporates an "ingredient" strategy, harvesting fragments (e.g., substrings, numeric values) from observed seeds and reusing them during mutation to increase diversity (Liu et al., 2023).

3. Test Generation and Benchmark Augmentation

EvalPlus’s augmentation has been concretely demonstrated in the creation of HumanEval+, an extended version of HumanEval. The key quantitative characteristics are:

Original HumanEval: 164 Python problems, averaging 9.6 hand-written test cases per problem.
EvalPlus: For each problem, ≈90 ChatGPT-generated seeds, plus roughly 1,000 mutants synthesized within a fixed one-hour time budget.
After filtering and deduplication, the final test count averages 764.1 per problem—a roughly 80-fold increase—yielding ~125,000 total tests for HumanEval+.

EvalPlus further corrects known errors in existing datasets, fixing 18 erroneous ground-truth solutions and annotating 83 tasks with explicit precondition "contracts" to enforce input validity (Liu et al., 2023).

4. Correctness Metrics and Evaluation Formulas

EvalPlus retains the established unbiased pass@k metric, consistent with the Codex baseline:

Let $n$ code samples be generated for a given task, with $c$ of them passing all tests.
The estimated pass@k is:

$\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

This evaluates the probability that at least one out of $k$ samples is functionally correct, given the augmented suite.

Delta pass rate for a model $M$ and a particular $k$ is defined as:

$\Delta \text{pass@k} = \text{pass@k}_\text{HumanEval} - \text{pass@k}_\text{HumanEval+}$

This measures the reduction in observed accuracy resulting from the expanded suite—quantifying the extent of false positives undetected by the original tests (Liu et al., 2023, Hu et al., 20 Jan 2025).

5. Experimental Results and Empirical Impact

Evaluation of 26 LLMs (OpenAI and open-source) demonstrates substantial declines in pass rates under HumanEval+. Selected results for pass@1⋆ (greedy decoding) are summarized below:

Model	HumanEval	HumanEval+	Δ (pts)
GPT-4	88.4%	76.2%	–12.2
ChatGPT	73.2%	63.4%	–9.8
WizardCoder-CodeLlama (34B)	73.2%	64.6%	–8.6
Phind-CodeLlama (34B)	71.3%	67.1%	–4.2
CodeLlama (34B)	51.8%	42.7%	–9.1
StarCoder (15B)	34.1%	29.3%	–4.8
CodeGen-16B	32.9%	26.8%	–6.1

The maximum observed $\Delta$ is –23.1 points (pass@1⋆), –19.3 points (pass@1), –24.9 points (pass@10), and –28.9 points (pass@100). Notably, these augmented evaluations reveal previously undetected mis-rankings; for example, WizardCoder-CodeLlama and Phind-CodeLlama, which trailed ChatGPT under HumanEval, now outperform it when assessed under HumanEval+ (Liu et al., 2023).

External researchers adopting EvalPlus-derived suites ("MBPP-EvalPlus", "HumanEval-EvalPlus") confirm their stringency: for instance, QualityFlow observed drops in pass@1 from 94.2% to 79.9% on MBPP, and from 98.8% to 89.6% on HumanEval, with corresponding shifts in recognized state-of-the-art performance (Hu et al., 20 Jan 2025).

6. Analytical Insights and Implications for LLM Benchmarking

The application of EvalPlus surfaces several insights relevant to both evaluation methodology and LLM development:

Testing insufficiency in original benchmarks systematically overstates LLM code generation correctness and can distort model rankings.
Even ground-truth solutions were found to be incorrect in 11% of tasks, errors that emerged only under expanded testing.
Automated hybrid (LLM+mutation) test generation scales evaluation suites to required coverage levels without prohibitive manual effort.
Greedy set-cover can reduce augmented suites to around 16 tests per task while retaining most of the error-detection power—enabling practical rapid-testing for research iteration.
Future benchmark design should incorporate automated test expansion, explicit input contracts, provision of minimal but high-coverage suites, and consideration of differential testing across multiple oracles (Liu et al., 2023).

7. Adoption, Open-Source Availability, and Future Directions

EvalPlus has been open-sourced, with tooling, datasets, and LLM-generated code available for community use at https://github.com/evalplus/evalplus. Its methodological integration into subsequent program synthesis workflows—such as QualityFlow’s plug-in of HumanEval-EvalPlus and MBPP-EvalPlus—demonstrates both its portability and its potential status as a de facto rigorous evaluation standard for the functional assessment of LLMs in code generation domains (Hu et al., 20 Jan 2025, Liu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation (2023)

QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvalPlus Framework.