HumanEval+: Enhanced LLM Code Evaluation

Updated 16 January 2026

HumanEval+ is a benchmark for LLM code generation that significantly expands test cases and enforces strict correctness criteria.
It uses a hybrid of LLM-based seed input generation and mutation-based genetic fuzzing to systematically cover edge cases.
Empirical results show notable drops in model performance, prompting refined model ranking and training methodologies in LLM research.

HumanEval+ is a rigorous enhancement of the widely adopted HumanEval benchmark designed to provide a more discriminative, functionally comprehensive, and robust evaluation of code generation by LLMs. Unlike the original HumanEval—which evaluates generated Python code against a limited suite of unit tests and often overestimates model correctness—HumanEval+ systematically multiplies test coverage, addresses corner-case behaviors, and establishes a more demanding functional yardstick. Since its introduction by Liu et al. (NeurIPS 2023) via the EvalPlus framework, HumanEval+ has become a critical reference point for measuring genuine code robustness, influencing dataset construction, model development, and ranking methodologies in LLM research (Liu et al., 2023, Koohestani et al., 7 Mar 2025, Luo et al., 2023). Its core philosophy of aggressive test-suite augmentation, edge-case targeting, and rigorous correctness criteria has been broadly adopted, spurring further specialization and extension (e.g., MHPP, mHumanEval, Qiskit HumanEval, CoCoNUT).

1. Motivation and Benchmark Evolution

HumanEval+ was introduced explicitly to rectify the limitations of the original HumanEval, whose small, heavily “happy-path” test suites (≈7–8 per problem) enable code LLMs to overfit to superficial docstring cues and pass without embodying truly generalizable functional correctness. The key objectives driving HumanEval+ include:

Massively expanded test coverage: Each of the original 164 HumanEval problems receives an order-of-magnitude increase (≈80×) in test cases, resulting in an average of ~616–774 tests per problem (Liu et al., 2023, Koohestani et al., 7 Mar 2025, Luo et al., 2023).
Explicit boundary and negative input testing: The suite now stresses rare corner cases, off-by-one errors, boundary conditions (empty structures, negative numbers), and explicit exception handling.
Correction of legacy errors: All test harness bugs, misplaced assertions, import faults, and reference solution mistakes are systematically fixed.
Generalization over docstring-memorization: By saturating the input space, HumanEval+ seeks to discriminate between code that is robustly correct on unseen inputs vs. code that simply passes the few visible unit tests.

This transition renders pass@k scores considerably more demanding, with empirical pass@1 scores for many models dropping by 8–32 percentage points relative to their original HumanEval marks.

2. Benchmark Construction and Test Augmentation

HumanEval+ is constructed by instrumenting each reference solution for statement and branch coverage, then iteratively generating additional test cases using a hybrid LLM- and mutation-based genetic fuzzing procedure (the EvalPlus framework) (Liu et al., 2023):

LLM-based seed generation: ChatGPT (or similar) proposes initial “interesting” or “corner-case” inputs, filtered algorithmically via explicit preconditions embedded in the ground-truth function contract.
Type-aware mutation-based expansion: Seed inputs are recursively mutated in a type-respecting fashion (e.g., flipping booleans, adding/removing list elements, perturbing strings) to probe input diversity.
Oracle-driven differential testing: Candidate solutions are validated via strict agreement with reference outputs across the expanded suite.
Test-suite reduction (optional): Greedy set-cover selection ensures that a compact “HumanEval++” variant (~16 tests/problem) maintains 99% of coverage and bug-detecting power.

Resulting datasets exhibit near-saturated statement and branch coverage (C+ ≈ 0.98, up from C0 ≈ 0.58 for HumanEval). Each solution is considered correct only if it passes all generated tests—partial solution acceptance is explicitly precluded.

3. Evaluation Metrics and Methodology

HumanEval+ uses the unbiased pass@k metric first introduced in the Codex paper:

$\mathrm{pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

where:

$n$ is the number of completions per problem,
$c$ is the count of completions passing all tests,
$k$ is the number of draws (typically 1, 5, 10, or 100).

For greedy decoding ( $k=1$ ), pass@1 reduces to the fraction ( $c/n$ ) of completions that are strictly correct. This strict, all-or-nothing criteria amplifies the discriminative power of HumanEval+, revealing substantial model performance disparities invisible on the original benchmark.

Empirical scoring protocols enforce the “zero-shot” format (no few-shot exemplars), random seed stability, and, where possible, batch execution to minimize runtime stochasticity (Liu et al., 2023, Koohestani et al., 7 Mar 2025, Luo et al., 2023).

4. Empirical Findings and Model Ranking

HumanEval+ exposes previously unnoticed weaknesses in models and often inverts relative rankings. Key observations include:

Universal score drop: All models exhibit a pronounced pass@k decline; e.g., GPT-4 pass@1 decreases from 88.4% (HumanEval) to 76.2% (HumanEval+) (Liu et al., 2023), while code-davinci-002 drops from 54.2% to 22.7% (Koohestani et al., 7 Mar 2025).
Mis-order correction: Models that appeared tied or even weaker than ChatGPT on HumanEval (e.g., WizardCoder-34B, Phind-34B) now clearly outperform it on HumanEval+ (Liu et al., 2023).
Difficulty stratification: HumanEval+ allows definition of a per-problem “difficulty index,” $d_i = -\log_{10}(\mathrm{pass@1}_i)$ , empirically showing a shift from $\bar{d}_0 ≃ 0.20$ (HumanEval) to $\bar{d}_+ ≃ 0.58$ (HumanEval+).
State-of-the-art results: Open and closed-source models such as WizardCoder (59.8% pass@1) (Luo et al., 2023), FlowGenScrum (75.2% pass@1) (Lin et al., 2024), Scattered Forest Search (67.1% pass@1) (Light et al., 2024), and XFT (64.6% pass@1 for a 1.3B model) (Ding et al., 2024) have explicitly used HumanEval+ as a central evaluation yardstick.

A compact illustration:

Model	HumanEval pass@1	HumanEval+ pass@1	Δ (pp)
GPT-4	88.4%	76.2%	–12.2
ChatGPT (GPT-3.5)	73.2%	63.4%	–9.8
WizardCoder-34B	73.2%	64.6%	–8.6
CodeLlama-34B	52.0%	43.1%	–8.9

5. Extensions, Variants, and Specialized HumanEval+ Benchmarks

HumanEval+ serves as the foundation for further benchmark specialization and diversification:

MHPP: Introduces 140+ newly curated, contamination-free problems, explicitly stratified by challenge type (Distraction, Redefinition, Shortcut, Commonsense, Cornercase, Complexity, Codesense), providing a “HumanEval+” in both granularity and discrimination (Dai et al., 2024).
mHumanEval: Scales prompts to 204 natural languages, retaining the original problem set but enabling true assessment of multilingual code generation, revealing steep performance drops for low-resource languages (Raihan et al., 2024).
Qiskit HumanEval: Adapts the rigorous functional-correctness paradigm to quantum programming via Qiskit, with a three-tiered difficulty taxonomy (Basic, Intermediate, Advanced) (Vishwakarma et al., 2024, Kheiri et al., 16 Jul 2025).
CoCoNUT: Concentrates on execution trace matching and advanced control-flow constructs—recursion, parallelism, OOP—surfacing model deficiencies not captured by functional-only tests (Beger et al., 27 Jan 2025).
HumanEvalNext: BenchFrame-based enhancement with new problems, corrected errors, and further escalated coverage and difficulty (Koohestani et al., 7 Mar 2025).

These variants aim to capture dimensions of robustness, reasoning depth, and real-world challenge coverage unreachable by the original HumanEval design.

6. Methodological Impact and Implications

The adoption of HumanEval+ has led to a number of methodological advances in LLM evaluation and training:

Prevention of metric inflation: Single-digit pass@k accuracy deltas on HumanEval can conceal tens of percentage points in true correctness gaps when measured by HumanEval+.
Automated test generation: EvalPlus-type LLM+fuzzer augmentation is now a recommended protocol for building or refurbishing code synthesis benchmarks (Liu et al., 2023).
Ranking rectification: Benchmarkers are cautioned against drawing comparative claims from original HumanEval scores; HumanEval+ is now preferred in serious evaluation pipelines for discriminating among competitive models.
Prompting and modularization: HumanEval+ has driven the evolution of prompt engineering—methods like MoT, modular graph-based prompting, and agent-based protocols consistently show higher reliability on HumanEval+ than on narrower testbeds (Pan et al., 16 Mar 2025, Lin et al., 2024).

7. Limitations, Open Problems, and Future Directions

Despite its strengths, HumanEval+ exhibits limitations acknowledged in the literature:

Semantic and logical coverage: While HumanEval+ dramatically elevates test density, it remains structurally tied to reference solutions; it is possible to overfit to the augmented test harness without deeply learning algorithmic semantics (Liu et al., 2023, Koohestani et al., 7 Mar 2025).
End-to-end complexity: The benchmark centers on function-level correctness and does not capture broader program synthesis requirements (e.g., type system, stateful context, interactive or temporal constraints).
Language and problem diversity: HumanEval+ is fixed to 164 Python problems; extension to broader algorithmic, domain-specific, or multi-language scenarios is a recognized need (addressed by mHumanEval, Qiskit HumanEval, MHPP) (Dai et al., 2024, Raihan et al., 2024, Vishwakarma et al., 2024).
Execution-only feedback: HumanEval+ assesses input-output conformance but not higher-order program properties (e.g., asymptotic complexity, resource consumption) except in special cases (Zhang et al., 12 Aug 2025).

Active research agendas include scaling automated test-set creation to new domains, incorporating formal verification and symbolic feedback (e.g., CodeGrad (Zhang et al., 12 Aug 2025)), increasing modularity (MoT (Pan et al., 16 Mar 2025)), and integrating HumanEval+ into continual, open-source leaderboards (Liu et al., 2023).

In summary, HumanEval+ has established itself as the gold standard for rigorous, high-coverage, discriminatively robust evaluation of function-level code generation by LLMs. Its paradigm of exhaustive test augmentation, enforceable edge-case handling, and its empirical impact on model ranking and training methodology make it central to contemporary LLM-for-code research. Ongoing efforts are expanding its scope, granularity, and connections to end-to-end and multilingual code-generation benchmarks.