HumanEvalPack: Multi-Task Code Benchmark
- HumanEvalPack is an execution-based, multilingual benchmark that evaluates instruction-tuned code LMs in synthesis, repair, and explanation tasks.
- It extends the original HumanEval benchmark by incorporating systematic bug-fixing and code explanation modalities across six programming languages with deterministic test suites.
- Empirical evaluations show that error-aware prompting and mixed pretraining data significantly enhance repair and explanation performance.
HumanEvalPack is an execution-based multi-task, multilingual benchmark designed to rigorously evaluate the capabilities of instruction-tuned code LMs. Building on the original HumanEval Python benchmark, HumanEvalPack introduces systematic bug-fixing (repair), code explanation, and synthesis tasks spanning six programming languages, each problem paired with a deterministic test suite to assess functional correctness. The benchmark is structured to illuminate model generalization beyond synthesis, emphasizing practical scenarios faced by real-world developers and toolchains.
1. Benchmark Design and Structure
HumanEvalPack was introduced as part of the OctoPack project (Muennighoff et al., 2023). It extends the scope of HumanEval—originally a code-synthesis suite of 164 Python problems—by incorporating two additional evaluation modalities:
- Code Repair (HumanEvalFix, NL+C→C): Models receive a function with a carefully injected bug (maintaining syntactic validity but failing at least one test), plus associated tests. The objective is to produce a repaired function that passes all reference test cases in a single decoding pass. Each original problem is translated with an analogous bug and test harness for six distinct programming languages (Python, JavaScript, Java, Go, C++, Rust), yielding 984 repair samples.
- Code Explanation (HumanEvalExplain, NL+C→NL→C): Models are tasked to generate a concise natural-language explanation (docstring) for a correct function. Subsequently, they regenerate code using only the explanation, with correctness assessed via execution on the test harness. Explanations are constrained to the original docstring’s character length, promoting brevity and semantic fidelity.
- Code Synthesis (HumanEvalSynthesize, NL→C): The canonical HumanEval task, generating function bodies from natural-language docstrings for each of the six languages.
By grounding evaluation in robust, manually translated problem sets and retaining original, decontaminated unit tests, HumanEvalPack facilitates cross-linguistic and multi-modal assessment under standardized execution.
2. Task Modalities and Input–Output Specifications
Each task modality is explicitly defined in terms of its input artifacts, instructions, and output requirements:
- Repair tasks: Input consists of the buggy function and reference tests. Models are prompted with task-specific instructions to “Fix bugs in <function name>” and must output corrected code. Bug injection is manual, targeting subtle implementation-level discrepancies observable only through precise functional testing.
- Explanation tasks: The input is a correct, undocumented function. The output must be a natural-language description within the original docstring’s character cap. The subsequent step, code regeneration from the model’s own explanation, is scored by execution; lexical comparison metrics (BLEU, METEOR) show limited discriminative power in this context, with BLEU-2/3/4 zeroed and METEOR consistently low (Muennighoff et al., 2023).
- Synthesis tasks: Inputs are the function signature and behavioral docstring; outputs are function bodies. Prompting strictly adheres to standardized format (“Write a <lang> function <name> …”).
All combinations are evaluated through deterministic execution—any completion must pass every test case for the problem to be considered solved, corresponding to pass@k metrics.
3. Evaluation Protocols and Metrics
HumanEvalPack employs execution-based metrics:
- pass@k: The probability at least one of k sampled completions passes all test cases. For sample count and %%%%1%%%% correct solutions: (sampling without replacement). For , this simplifies to .
- fix@1 (function-level repair): For the repair variant, fix@1 is defined as , where is total problems ( for Python) and is the count of functions fully corrected in a single pipeline pass (Twist, 24 Nov 2025).
All sampled generations are performed zero-shot, absent in-context examples or chain-of-thought augmentation. Decoding settings in recent evaluations have standardized on temperature = 0.2, top_p = 1.0 or 0.95, and a generation length up to 2048 tokens for repair (Hasan et al., 3 Jul 2025, Twist, 24 Nov 2025). Statistical significance for language-wise variation is established via one-way ANOVA; recent SLM studies report no significant difference in mean pass@k across Python, JavaScript, Java, or C++ () (Hasan et al., 3 Jul 2025).
4. Empirical Results and Model Comparison
HumanEvalPack has benchmarked a wide range of models—both open-source, permissive SLMs (0.4–10B params) and closed/proprietary LLMs (e.g., GPT-4). Model performance is synthesized below:
| Model/Language | Repair pass@1 (%) | Explanation pass@1 (%) | Synthesis pass@1 (%) |
|---|---|---|---|
| OctoCoder (Py) | 30.4 | 35.1 | 46.2 |
| OctoGeeX (Py) | 28.1 | 30.4 | 44.7 |
| GPT-4 (avg) | 47.8 | 52.1 | 78.3 |
| OctoCoder (avg) | 27.0 | 24.5 | 35.5 |
| OctoGeeX (avg) | 24.4 | 22.9 | 30.9 |
Pass@1 for SLMs on HumanEvalPack typically spans 0.23–0.25 for Python, JS, Java and 0.16 for C++; SLMs show no significant language gaps (Hasan et al., 3 Jul 2025).
Summary-mediated repair pipelines, in which LLMs first generate a code summary (with error-awareness or intent emphasis) and only then attempt a repair, exhibit consistent but modest gains over direct repair prompts. Across eight production-grade LLMs, error-aware summaries achieve up to 64.63% fix@1, averaging +5 points over direct repair, with gains ranging from +1.22% to +10.37% (Twist, 24 Nov 2025). Diagnostic summaries systematically clarify intended logic and aid correction of value-misuse bugs or excess branches, but struggle with missing or deeply nested algorithmic logic.
5. Cross-Language and Modal Generalization
HumanEvalPack is engineered to expose strengths and weaknesses in multilingual and multi-modal model capabilities:
- Language Coverage: Six languages—Python, JavaScript, Java, Go, C++, Rust—each with 164 problem instances per task. Average problem difficulty and input lengths are consistent across languages, save for Rust, which yields shorter solutions.
- Generalization: SLMs and instruction-tuned LMs maintain pass@k scores across languages without collapse, demonstrating effective transfer. Empirical studies showed no statistically significant difference in model accuracy per language (Hasan et al., 3 Jul 2025).
- Resource Scaling: Transitioning from 1.5B-parameter SLMs to 7B models yields ~0.1 gain in pass@1 at 3–4× VRAM cost. Mid-sized SLMs in 1.5–3B range deliver up to ~0.6 pass@1 on code-based tasks using sub-12 GB VRAM, representing an efficiency/performance inflection (Hasan et al., 3 Jul 2025).
6. Data Curation and Pretraining Insights
Key ablations reveal sensitivity to instruction-tuning corpus composition (Muennighoff et al., 2023):
- CommitPackFT (filtered Git commits) inclusion is essential for repair skills: Absence of this data constrains fix performance to ≲23% for StarCoder, whereas inclusion lifts performance to 30.4%, near state-of-the-art among permissive models.
- Natural-language (NL) targets in instruction data matter: Strong code-explanation (NL output) performance is only realized in models trained on sufficient NL supervision (e.g., OASST, CommitPackFT+OASST mixtures).
- Code synthesis proficiency benefits equally from all instruction datasets. The highest aggregate performance for synthesis, repair, and explanation manifested in mixed pretraining regimens combining commit data and NL tasks.
A plausible implication is that high-quality, real-world code-editing data and NL supervision are both necessary for robust, generalizable instruction-tuned code LMs for HumanEvalPack-style evaluation.
7. Current Limitations and Future Directions
HumanEvalPack reveals several informative bottlenecks in automated program repair and code understanding:
- Limitations: Even with error-aware prompting or summary mediation, LMs are predominantly effective at correcting value-misuse and redundant logic but struggle with structural omissions and arithmetic edge cases. There is no evidence of statistically significant gains from few-shot prompting for base SLMs in this setting (Twist, 24 Nov 2025, Hasan et al., 3 Jul 2025).
- Statistical Evaluation: Recent work does not report formal significance tests or confidence intervals on function-level repair; results are deterministic, single-run percentages under controlled decoding parameters.
- Research Gaps: Proposed improvements include hybrid summaries, adaptive prompt tuning, richer fault localization signals, multi-round or k-shot repair, and extension to larger project-level bugs for amplified correction yield (Twist, 24 Nov 2025).
- Practical Recommendations: For code generation pipelines with strict resource constraints, small- and mid-sized SLMs (1–3B params) can be deployed confidently across all HumanEvalPack languages, offering competitive functional correctness and efficiency without risk of language-specific collapse (Hasan et al., 3 Jul 2025).
Together, HumanEvalPack stands as a rigorous, flexible, and multi-faceted benchmark for evaluating the generalization, repair, and explanation capabilities of contemporary code LMs, incentivizing both methodological and empirical advancement in code intelligence research (Muennighoff et al., 2023, Twist, 24 Nov 2025, Hasan et al., 3 Jul 2025).