HumanEval-X Benchmark

Updated 3 October 2025

HumanEval-X Benchmark is a multilingual evaluation suite that extends the original HumanEval to assess code generation and translation tasks across five programming languages.
It features manually rewritten problem–solution pairs in Python, C++, Java, JavaScript, and Go, ensuring consistent, language-specific evaluation using rigorous test cases.
The benchmark employs unbiased pass@k metrics and dynamic budget allocation strategies to reliably measure model performance and real-world coding efficiency.

HumanEval-X Benchmark is a multilingual evaluation suite for LLMs targeting code generation and translation. It systematically extends the HumanEval benchmark—which originally focused solely on Python—to include C++, Java, JavaScript, and Go, resulting in an expansive set of hand-crafted problem–solution pairs for rigorous functional correctness testing across languages. HumanEval-X supports both code generation and code translation tasks, incorporates unbiased statistical estimators for model evaluation, and provides the infrastructure and methodology for benchmarking multilingual code models. The construction, usage, and implications of HumanEval-X are pivotal in current research on code synthesis, model generalization, and productivity analysis.

1. Construction and Multilingual Scope

HumanEval-X was derived by manually rewriting all 164 original HumanEval (Python) problems—including prompts, canonical solutions, and unit test suites—for four additional languages (C++, Java, JavaScript, Go), totaling 820 language-problem pairs (Zheng et al., 2023). Each problem specification comprises:

Language-specific declarations (including necessary imports or libraries)
Docstrings clearly describing the task with explicit input/output examples
Canonical reference (solution) implementations
Multiple test cases for checking functional correctness

The benchmark was deliberately designed for cross-language consistency: every problem maintains content and task parity across target languages, with implementations and test cases tailored to language-specific syntax and idioms. This manual curation ensures high fidelity and reliability for both code generation and translation evaluation.

2. Evaluation Tasks and Metrics

HumanEval-X supports two primary evaluation tasks:

Code Generation: Models receive a prompt containing a function declaration and docstring in a specific language, and must synthesize working code that passes all test cases.
Code Translation: Models are given source language implementations and function declarations in both source and target languages. The docstring is omitted to prevent memorization and force "real" translation. The synthesized code must replicate functionality across languages.

The key performance metric is pass@k: the expected probability that at least one out of k generated samples passes all associated test cases. For comprehensive evaluation, models sample n = 200 candidate solutions per task. Pass@k is computed using an unbiased estimator:

$\text{pass}@k = \mathbb{E}\left[1 - \frac{\binom{n-c}{k}{\binom{n}{k}\right]$

where n is the total samples, c the count of correct samples, and k is the evaluation budget. Temperature and nucleus sampling parameters are separately tuned for pass@1, pass@10, and pass@100.

HumanEval-X introduces budget allocation across languages for multilingual models. Strategies (“Best Single”, “Uniform”, “Weighted”) distribute the sampling budget k among languages—Weighted allocation correlates generation per language to its representation in the model’s training corpus, empirically increasing the likelihood of generating a correct solution when evaluated collectively.

3. Model Benchmarking and Performance

CodeGeeX (13B parameters, trained on 850B tokens over 23 languages) was benchmarked against GPT-J‑6B, GPT‑NeoX‑20B, InCoder‑6.7B, and CodeGen‑Multi (6B, 16B) (Zheng et al., 2023). Metrics reported for code generation (Python):

Model	pass@1 (%)	pass@10 (%)	pass@100 (%)
CodeGeeX	22.89	39.57	60.92
CodeGen-16B	[lower]	[lower]	[lower]

Comparable or superior performance was recorded for CodeGeeX in other languages as well, with average improvements over CodeGen‑Multi-16B between 0.4–1.7%.

For code translation (20 language pairs), CodeGeeX and its fine-tuned variant (CodeGeeX-13B-FT) were evaluated. Fine-tuning yielded notable gains—especially in harder translation directions. However, the paper observed directional asymmetry: translation performance A → B is negatively correlated with B → A, exposing non-uniform multilingual proficiency.

Radar plots and granular error analyses in the original paper highlight performance disparities by language and sampling configuration. Results support that CodeGeeX consistently outperforms comparable multilingual baselines in both generation and translation across all five evaluated languages (Zheng et al., 2023).

4. Decontamination and Integrity Issues

Recent work demonstrates that benchmarks similar to HumanEval-X are susceptible to “contamination”: benchmark problems (or rephrased variants) may inadvertently enter LLM training corpora, inflating model evaluation metrics and undermining reliability (Yang et al., 2023). Conventional n-gram matching and embedding similarity methods fail to catch subtle rephrasings. Advanced LLM-based decontaminators (Algorithm: embedding filter followed by LLM semantic matching) reliably detect overlapped or rephrased benchmark data.

Empirical findings indicate 8–18% overlap between HumanEval and common training sets (RedPajama, StarCoder), and synthetic datasets show 12.8% contamination. This directly compromises the integrity of benchmarking for HumanEval-X-style tasks, and suggests the need for dynamic, one-time test assignment and robust decontamination tooling.

5. Impact on Coding Efficiency and Real-World Development

CodeGeeX was integrated as extensions for major IDEs (Visual Studio Code, JetBrains, Cloud Studio) supporting code generation, completion, translation, and explanation (Zheng et al., 2023). Usage statistics:

4.7 billion tokens generated weekly for tens of thousands of users
>83.4% of surveyed users reported improved coding efficiency
Average >250 API calls per active user per weekday

These findings suggest HumanEval-X pass@k correlates with tangible productivity improvements, reinforcing the benchmark’s relevance for evaluating real-world LLM utility.

6. Limitations, Extensions, and Future Research Directions

HumanEval-X advances rigorous multilingual functional evaluation, but several open issues remain:

Asymmetry in translation reveals uneven model generalization—future work should deepen multilingual reasoning and explore language-specific fine-tuning.
Benchmarks such as CRUXEval-X (Xu et al., 23 Aug 2024) propose more scalable, automated, test-guided construction pipelines for multi-language, code reasoning evaluation (predicted output/input, not only code synthesis).
Integration of efficiency metrics (eff@k) (Qiu et al., 10 Jun 2024), human-centric subjective evaluation (Guo et al., 2 Jun 2025), and coverage expansion across domains (Zhu et al., 23 Aug 2024) and natural languages (Raihan et al., 19 Oct 2024) can further enhance evaluation depth.
The need for dynamic template-based benchmarking (Bradbury et al., 2 Dec 2024) to mitigate data leakage and support longitudinal fairness in performance assessment.

These points collectively indicate that HumanEval-X, while foundational, is part of an evolving ecosystem of benchmarks addressing correctness, efficiency, robustness against contamination, and multilingual capability.

7. Significance Within the Benchmarking Landscape

HumanEval-X establishes a multilingual reference for code generation and translation tasks, introducing robust evaluation protocols, allocation strategies, and practical metrics. Its construction methodology and findings (by THUDM) directly influenced subsequent multilingual and reasoning-centric benchmarks. By quantitatively and qualitatively assessing model outputs in multiple programming languages, HumanEval-X facilitates fair and reproducible model comparison and has become standard for both academic and industrial model validation in code synthesis research.

Efforts to extend coverage, address contamination, improve scalability, and add nuanced metrics reflect both the influence and the recognized limitations of HumanEval-X. Its design and usage guide ongoing development of comprehensive, trustworthy, and actionable model evaluation frameworks for diverse computational programming settings.