HumanEvalNext: Multilingual Code Benchmark

Updated 16 January 2026

HumanEvalNext is a next-generation code benchmark that overcomes limitations of the original HumanEval by expanding to 204 natural languages and 25 programming languages.
It uses an ensemble translation and quality assurance pipeline combining automated metrics and expert review to ensure high-fidelity multilingual code evaluations.
Empirical findings demonstrate that multilingual pre-trained models perform well in high-resource languages, while performance drops in lower-resource settings highlight areas for improvement.

HumanEvalNext refers to the lineage of next-generation code generation benchmarks that address critical limitations of the original OpenAI HumanEval dataset—most notably, its monolingual English/Python scope, limited diversity, and minimal test coverage—by introducing expanded linguistic, programming, and evaluative breadth. The mHumanEval benchmark exemplifies this evolution, transforming HumanEval into a multilingual, multi-programming-language resource with rigorous quality assurance, and motivates a paradigm shift in how modern code LLMs are assessed under realistic, global use conditions (Raihan et al., 2024).

1. Motivation: Limitations of Original HumanEval

The original HumanEval suite rapidly became the de facto evaluation resource for natural language-to-code generation, combining 164 hand-curated Python programming tasks, English docstrings, and only three test cases per task. This structure introduced three foundational deficiencies (Raihan et al., 2024):

Task diversity and overfitting risk: The small, fixed set of tasks and test cases permitted models to memorize patterns, leading to inflated pass rates not indicative of true generalization.
Linguistic scope: All prompts are given in English, with solutions in Python. This is unrepresentative for code LLMs deployed in global or multilingual contexts.
Test coverage: Three test cases fail to exercise edge or rare-case logic, leading to a systematic overestimate of model correctness. Empirical analysis (Liu et al., 2024) showed that augmenting with additional tests substantially decreases measured pass rates, indicating lower real-world code robustness.

These shortcomings underscore the necessity for benchmark families that can meaningfully test cross-lingual and multi-programming-language generation, with more rigorous evaluative depth.

2. Design Principles of mHumanEval (“HumanEvalNext”)

mHumanEval represents the primary realization to date of a “HumanEvalNext” benchmark (Raihan et al., 2024). It retains the original 164 HumanEval tasks but introduces multiplicative diversity via:

Natural Language (NL) axis: 204 languages (from FLORES-200), encompassing high-resource (e.g., English, Spanish), mid-resource, and low-resource (e.g., Gondi, Tamasheq) families.
Programming Language (PL) axis: 25 languages including Python, C++, Java, JavaScript, Ruby, and, in a first for code LLM benchmarks, MATLAB, Visual Basic, Fortran, and COBOL.

This expansion enables systematic assessment of LLM performance when prompted in non-English, low-resource natural languages and targets code synthesis in diverse PLs.

Dimension	HumanEval (original)	mHumanEval (HumanEvalNext)
NL coverage	1 (English)	204 (FLORES-200)
PL coverage	1 (Python)	25 (Python, C++, Java, etc.)
Test cases per problem	3	3 (per NL × PL; future: more)
Quality assurance	Minimal	Automated + expert + metric-based

3. Translation and Quality Assurance Pipeline

Each original English docstring is subjected to an ensemble translation protocol (Raihan et al., 2024):

Generation: For each NL, five translations with GPT-4o, one with Meta’s NLLB, one with Google Translate (if supported), and round-trip consistency checks, yielding up to 13 candidates per docstring-language pair.
Quality Selection: Candidates are scored by both BERTScore (contextual semantic overlap) and COMETKiwi (reference-free adequacy/judgment metric), averaged to select the highest-quality translation. For NLs missing COMETKiwi support, BERTScore is used exclusively.
Human Oversight: For 15 representative NLs (covering all FLORES-200 resource classes), native-speaking translators with programming expertise provide gold-standard human docstrings, double-checked by expert programmers for objective/task fidelity.

Expert review in these languages reveals that high-quality MT (with automated selection) approaches human annotation standards, with F1-BERTScore and COMETKiwi variances within ±0.02–0.03, supporting scalability to hundreds of NLs.

4. Test Suite and Evaluation Protocol

As with HumanEval, each problem in mHumanEval currently employs the original three test cases, but applies them across all NL × PL pairs (33,456 combinations for 164 tasks). The benchmark’s designers acknowledge that this coverage remains a bottleneck for robust functional assessment and plan to augment with edge/path-coverage and randomized test sets in future iterations (Raihan et al., 2024).

Evaluation uses the standard Pass@k metric for zero-shot generation, typically reporting Pass@1 due to computational constraints:

$\text{pass@k} = 1 - \sum_{i=0}^{k-1} {\binom{n-i}{k}} / {\binom{n}{k}}$

where $n$ is the number of completions per task and $k$ the value sought.

Code submissions are extracted via regular expression, executed in a subprocess sandbox, and assessed against reference tests. Prompt templates are tailored per model to match their best-practice interaction format.

5. Empirical Findings on Model Robustness and Failure Modes

Six leading LLMs—GPT-4o, Claude 3.5 Opus, GPT-3.5, DeepSeek-Coder, WizardCoder, Aya—were benchmarked across all NLs and Python (Raihan et al., 2024). Key quantitative outcomes:

Model	Mean Pass@1 (all 204 NLs, Python)
GPT-4o	0.738
Claude3.5	0.739
GPT-3.5	0.360
DeepSeek-Coder	0.229
WizardCoder	0.098
Aya	0.445

Multilingual capability tracks model pretraining: GPT-4o and Claude3.5, trained on large multilingual text+code corpora, retain Pass@1 ≳ 0.8 in high-resource languages and ≳ 0.6 in rare-resource NLs. English-centric models (e.g., GPT-3.5) drop to ≈0.2 in rare NLs; specialist code models (WizardCoder) effectively collapse outside English–Python.
Translation failure modes: Systematic mistranslation of programming keywords (e.g., “return” → “subiza”) results in syntactic errors. Semantic drift can invert problem meaning, e.g., Zulu translation for “detect prime numbers” becomes “find significant digits.”
Test coverage remains a central weakness: Three-case test suites frequently miss nuanced errors. This suggests that broader and more adversarial test generation is essential for future “HumanEvalNext” releases.

6. Broader Implications, Best Practices, and Future Directions

The mHumanEval methodology demonstrates that robust, scalable generation of multilingual/multi-PL code evaluation suites is feasible with current MT and QA technology, augmented by expert review for crucial NLs (Raihan et al., 2024). The observed equivalence between expert and top-ranking automated prompts indicates that model-selection-driven MT can be the backbone of future large-scale benchmarks.

A plausible implication is that the “HumanEvalNext” paradigm should prioritize, in addition to coverage and diversity:

More systematic, formal test augmentation for each problem to minimize pass@k inflation;
Continual integration of new languages and PLs as practical LLM deployment scenarios diversify;
Diagnosing and correcting translation artifacts that alter code semantics at a functional level.

The field is poised to evolve these benchmarks into multidimensional, scenario-realistic evaluation suites capturing the true capacity of next-generation code-generation models to operate globally, under realistic, polyglot prompts and outputs.

Markdown Report Issue Upgrade to Chat

References (1)

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HumanEvalNext.

HumanEvalNext: Multilingual Code Benchmark

1. Motivation: Limitations of Original HumanEval

2. Design Principles of mHumanEval (“HumanEvalNext”)

3. Translation and Quality Assurance Pipeline

4. Test Suite and Evaluation Protocol

5. Empirical Findings on Model Robustness and Failure Modes

6. Broader Implications, Best Practices, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HumanEvalNext: Multilingual Code Benchmark

1. Motivation: Limitations of Original HumanEval

2. Design Principles of mHumanEval (“HumanEvalNext”)

3. Translation and Quality Assurance Pipeline

4. Test Suite and Evaluation Protocol

5. Empirical Findings on Model Robustness and Failure Modes

6. Broader Implications, Best Practices, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research