HumanEval Benchmark Overview

Updated 8 February 2026

HumanEval is a benchmark suite featuring 164 Python problems with natural-language prompts and hidden unit tests to evaluate code generation.
It measures functional correctness using pass@k metrics, with an average of 7.7 tests per problem, highlighting both capabilities and overfitting risks.
Extensions like HumanEval+ and HumanEval-X address limitations by expanding test cases, integrating multilingual capabilities, and mitigating data contamination.

The HumanEval benchmark is a widely adopted evaluation suite for measuring the functional correctness of code generated by LLMs given structured natural-language specifications. Originating as the core program synthesis benchmark for OpenAI Codex, it has since become the de facto standard for assessing LLM code-generation capabilities, inspiring a lineage of multilingual, domain-specific, and protocol-augmented derivatives.

1. Core Design and Evaluation Protocol

HumanEval consists of 164 hand-authored Python programming problems, each defined by a function signature, a natural-language docstring, and a hidden suite of reference unit tests. The design aims to rigorously assess whether an LLM can map concise English prompts to correct, executable Python code solely by synthesizing function bodies that reliably pass all supplied test cases (Yu et al., 2024, &&&1&&&, Raihan et al., 2024, Koohestani et al., 7 Mar 2025).

The canonical task structure is:

Input: Function signature and natural-language docstring describing desired behavior.
Output: A Python function body generated by the model.
Test suite: On average, 7.7 unit tests per problem, designed to automatically validate functional correctness.

The principal performance metric is pass@k, which estimates the probability that at least one of k sampled model outputs passes all associated unit tests for a given problem. For $n$ samples per problem with $c$ correct completions, the unbiased pass@k estimator is:

$\mathrm{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$

Aggregate pass@k is the mean over all problems. In practice, pass@1 (greedy decoding) is universally reported, with higher k (e.g., pass@10, pass@100) used to probe stochastic sampling (Li et al., 2024, Yu et al., 2024, Raihan et al., 2024, Dunham et al., 2024).

2. Scope, Composition, and Identified Limitations

HumanEval was explicitly conceived as an “original” code-synthesis benchmark for Python, focused on function-level synthesis tasks characterized by:

Domain coverage: Elementary algorithms, data structure manipulation, and utility routines (e.g., string parsing, sorting, mathematical operations).
Input/output simplicity: Small inputs (numbers, strings, lists), straightforward outputs without complex side effects.
Test coverage: An average of 7.7 hand-written test cases per function; many problems admit trivial or overfit solutions due to coverage gaps.
Difficulty profile: Problems are concentrated in a narrow difficulty band: mostly “easy” tasks, with minimal edge-case or robustness constraints (Yadav et al., 2024, Koohestani et al., 7 Mar 2025, Zhang et al., 2024).

Critically, subsequent meta-evaluations exposed structural limitations:

Limitation	Description
Test suite sparsity	Small number of unit tests permits overfitting and fails to guard against edge failures
Lack of diversity	Overrepresentation of introductory concepts; little coverage of advanced/real-world APIs
No multi-language	All problems/solutions/unit tests are Python-specific
Under-specified IO	No file, database, or user interaction; exclusively direct value transformation
Overfitting risk	Known cases of model memorization/data leakage via public code corpora or rephrasings

The simplicity and structural homogeneity of HumanEval have led to pass@1 inflation for leading models, with scores as high as 91% now reported for top proprietary LLMs under zero-shot evaluation (Dunham et al., 2024, Li et al., 2024). However, performance drops markedly on more complex or real-world tasks, highlighting the overfitting and limited challenge posed by the original set (Zhang et al., 2024, Yu et al., 2024, Yadav et al., 2024).

3. Derivative and Augmented Benchmarks

Several direct extensions have sought to improve HumanEval's coverage, robustness, and linguistic/technical scope:

Benchmark	Language(s)	Extension Focus	Test Cases per Problem	Notable Features	Reference
HumanEval+	Python	Test-suite expansion	×80 original	~80× more tests; improved bug-catching and edge-case detection	(Koohestani et al., 7 Mar 2025)
HumanEval-MINI	Python	Test-suite expansion	×47 original	Substantial but smaller augmentation	(Koohestani et al., 7 Mar 2025)
HE-Eval	Python	Test-suite expansion	×14 original		(Koohestani et al., 7 Mar 2025)
InstructHumanEval	Python	Instructional prompt augmentation	~7.7	Docstrings recast as natural-language instructions	(Koohestani et al., 7 Mar 2025)
HumanEval-X	Python, C++, Java, JavaScript, Go	Multilingual (PL only)	~10–15	Full manual translation of problems, solutions, and tests; enables direct functional correctness evaluation across 5 PLs	(Zheng et al., 2023)
mHumanEval	Python	Massively multilingual (NL only)	3 (original)	Docstrings in 204 NLs; expert/human and high-quality MT variants	(Raihan et al., 2024)
HumanEval-XL	23 NLs × 12 PLs	Fully parallel multilingual	8.33	22,080 prompts; cross-lingual generalization; BERTScore/CometKiwi filtering	(Peng et al., 2024)
HumanEval_T	Python	Data-leakage-resilient variants	Varied	Combinatorial template instantiation ensuring lexical/semantic diversity	(Bradbury et al., 2024)
HumanEval-V	Python + Visual	Multimodal coding tasks	8–16	Paired diagrams; visual reasoning required; functional evaluation	(Zhang et al., 2024)
Qiskit HumanEval	Python/Qiskit	Quantum code generation	8–15	Quantum circuits, SDK API; solves circuit synthesis & manipulation	(Vishwakarma et al., 2024)
HumanEval Pro	Python	Compositional/self-invoking	10–15	Base + derived tasks; enforces hierarchical solution construction	(Yu et al., 2024)

Test suite multipliers (e.g., HumanEval+ ~×80) systematically increase evaluation rigor by detecting fragile or opportunistic solutions, while HumanEval-X, mHumanEval, and HumanEval-XL embed the problems in multilingual and/or multi-language settings, respectively.

Notably, mHumanEval emphasizes functionally equivalent prompt translation using high-fidelity MT and human annotation, while HumanEval-X completely handwrites reference solutions and tests for each language.

HumanEval-T formally introduces parameterized task templates and combinatorial test design to generate large families of semantically equivalent but lexically/structurally distinct variants, directly addressing data contamination and model memorization threats (Bradbury et al., 2024).

HumanEval Pro systematically constructs higher-order, self-invoking tasks that require integrating base-problem solutions as subroutines, exposing the deficiency of LLMs in progressive, compositional code reasoning (Yu et al., 2024).

4. Evaluation Methodologies and Metric Details

Across all HumanEval family benchmarks, functional correctness is dominant—solutions must pass all reference unit tests to be counted as correct. Metrics include:

pass@k: The standard functional metric, estimating success probability for at least one of k outputs per problem.
Parsing Success Rate: For multimodal/augmented benchmarks, the fraction of syntactically valid (non-crashing) completions.
Category-wise pass@1: When problems span distinct domains (e.g., HumanEval-V), results disaggregated by category.
Cross-lingual pass@k: Aggregated or per-language/family pass@k rates (for multilingual settings).

When stochastic generation is used, pass@k is computed over multiple samples $n$ per problem, with the estimator:

$\mathrm{pass}@k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}$

Aggregate metrics (mean across all tasks) are standard. For multilingual HumanEval-X and mHumanEval, composite metrics support cross-language budget allocation and stratified analysis (Zheng et al., 2023, Raihan et al., 2024, Peng et al., 2024).

5. Data Contamination, Overfitting, and Benchmark Integrity

HumanEval's ubiquity has made it particularly susceptible to data contamination in LLM pre-training corpora via direct inclusion, code paraphrase, or translation. Studies using LLM-based decontamination have identified contamination rates of 8–19% in major public and synthetic datasets, with 13B-parameter models fine-tuned on undetected rephrased samples achieving pass@1 above GPT-4 level (Yang et al., 2023). String-based decontaminators fail to catch even simple paraphrases or cross-lingual duplicates.

Contaminated evaluation results artificially inflate pass@k and can mask true model generalization gaps—prompting community calls for embedding-based retrieval plus LLM-based “semantic judge” procedures, and the introduction of template-based, combinatorially instantiated variants (as in HumanEval_T) for robust and leakage-resilient benchmarking. Related work recommends the adoption of ephemeral, one-time “exam” style test sets (Yang et al., 2023, Bradbury et al., 2024).

6. Empirical Characterization and Comparative Results

State-of-the-art zero-shot pass@1 on original HumanEval has risen from 13% (CODEX-300M) and <30% (CodeGEN, Codex-12B) to over 85% (GPT-4-1106, Claude Opus, Reactor Mk.1) in recent closed-source LLMs (Li et al., 2024, Dunham et al., 2024). Sample results:

Model	pass@1 (%)	pass@10 (%)
CODEX-300M	13.2	20.4
CODEGEN-Mono 6.1B	26.1	42.3
GPT-3.5-turbo-0301	72.2	89.0
GPT-4-1106-preview	85.7	98.2
Reactor Mk.1	91.0	—

Despite near-saturation on HumanEval, performance on more challenging benchmarks such as NaturalCodeBench, PythonSaga, HumanEval Pro, or HumanEval-V is significantly lower, with typical pass@1 drops of 10–60 percentage points (Zhang et al., 2024, Yu et al., 2024, Yadav et al., 2024, Zhang et al., 2024). This exposes a pronounced mismatch between leaderboard results on HumanEval-style “toy” problems and true robustness on realistic or compositional tasks.

7. Impact, Ongoing Developments, and Future Directions

The HumanEval suite and its extensions have fundamentally shaped code-generation LLM evaluation but are now recognized as insufficient for comprehensive system assessment and advancement. Current trends and recommendations include:

Test suite and domain expansion: Systematically augmenting test coverage (HumanEval+) and problem diversity (HumanEval-X, PythonSaga).
Multilingual integration: Support for multiple NLs and PLs as in mHumanEval, HumanEval-XL, and HumanEval-X.
Data leakage mitigation: Combating model training contamination with template- and combinatorial-based benchmarks and LLM-powered decontamination pipelines.
Compositional and multimodal tasks: Launching compositional (HumanEval Pro) and visually grounded (HumanEval-V) coding tasks to probe reasoning abilities beyond basic function synthesis.

The community increasingly advocates for open and transparent benchmark construction, deeper analysis of test coverage and difficulty stratification, and dynamic, continually refreshed test pools to ensure statistical validity and true generalization assessment (Koohestani et al., 7 Mar 2025, Bradbury et al., 2024, Yang et al., 2023, Yadav et al., 2024).

Markdown Upgrade to Chat

References (13)

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation (2024)

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts (2024)

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation (2024)

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol (2025)

HumanEval on Latest GPT Models -- 2024 (2024)

Reactor Mk.1 performances: MMLU, HumanEval and BBH test results (2024)

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs (2024)

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X (2023)

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization (2024)

10.

Addressing Data Leakage in HumanEval Using Combinatorial Test Design (2024)

11.

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks (2024)

12.

Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models (2024)

13.

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HumanEval Benchmark.