HumanEval-XL Code Generation Benchmark
- HumanEval-XL is a multilingual code generation benchmark that evaluates LLMs on 80 parallel coding problems across 23 natural languages and 12 programming languages.
- It uses a four-step translation and validation pipeline with GPT-4 and BERTScore to ensure semantic accuracy and technical integrity in prompt instances.
- Empirical results highlight GPT-4’s superior pass@1 performance, data resource tier impacts, and PL-specific challenges in cross-lingual code synthesis.
HumanEval-XL is a massively multilingual code generation benchmark specifically designed to enable rigorous evaluation of LLMs on cross-lingual natural language to code generation tasks. It establishes comprehensive coverage across 23 natural languages and 12 programming languages, providing a structured platform for assessing the cross-lingual generalization abilities of multilingual LLMs (Peng et al., 26 Feb 2024).
1. Dataset Scale and Structure
HumanEval-XL interlinks 23 natural languages (NLs) from 11 different language families and 12 programming languages (PLs): Python, Java, Go, Kotlin, PHP, Ruby, Scala, JavaScript, C#, Perl, Swift, and TypeScript. These dimensions are organized as a three-way parallel grid consisting of 80 base coding problems, with every unique (NL, PL) pair receiving 80 parallel prompt instances. This results in a total of 22,080 prompts, each associated with an average of 8.33 meticulously crafted test cases.
The prompt structure for each instance is as follows:
- Natural language description (in one of the 23 NLs)
- Function signature in the target PL
- An array of unit tests (typically 6–10)
The linguistic distribution is shown below:
| Family | Natural Languages |
|---|---|
| Afro-Asiatic | Arabic, Hebrew |
| Austro-Asiatic | Vietnamese |
| Austronesian | Indonesian, Malay, Tagalog |
| Indo-European (Germanic) | English, Dutch, German, Afrikaans |
| Indo-European (Romance) | Portuguese, Spanish, French, Italian, Greek |
| Indo-European (Iranian) | Persian |
| Slavic | Russian, Bulgarian |
| Sino-Tibetan | Chinese |
| Turkic | Turkish |
| Uralic | Estonian, Finnish, Hungarian |
Each NL-PL pair is consistently represented with 80 problems to ensure strict parallelism across the dataset.
2. Data Collection and Translation Pipeline
Data construction proceeds via a four-step expansion pipeline:
- NL Extraction: The English problem description is isolated by stripping away all code and language-specific elements.
- Translation and Back-Translation: GPT-4 is employed to translate English descriptions into each of the 23 NLs; subsequent back-translation into English validates preservation of semantics.
- Automatic Quality Assurance: For each NL instance, BERTScore is computed between the original and back-translated English. Only translations yielding BERTScore ≥ 0.95 are retained, with up to three regeneration attempts for each failed case.
- Manual Heuristic Review: Random samples are spot-checked for both linguistic fluency and technical correctness, with problematic translations being discarded if irreparable.
This procedure results in parallel multilingual prompt collections that maintain semantic equivalence and technical clarity across all language pairs.
3. Evaluation Metrics
HumanEval-XL employs standard, functionally grounded evaluation metrics:
- pass@k: For each prompt, given generated code samples of which are correct, pass@k is computed as:
where is the binomial coefficient (number of -element combinations from items). This metric estimates the probability that at least one out of samples passes all unit tests.
- Exact Match: The proportion of runs where the generated code matches a byte-for-byte reference solution.
All HumanEval-XL experiments report pass@1 using top-p=0.95, temperature sampling. No additional bespoke metrics were introduced.
4. Benchmark Use Cases and Empirical Trends
HumanEval-XL enables controlled, parallel evaluation of LLMs by holding the programming language and test cases constant while varying only the natural language prompt. This property facilitates precise measurement of NL-to-code generalization and language transfer.
Key empirical trends observed using HumanEval-XL include:
- Model Ranking: GPT-4 consistently achieves the highest pass@1 scores across all NLs and PLs, with margins of 20–30 percentage points over leading code-specialized models.
- Pretraining Effects: CodeGen2-16B outperforms GPT-3.5 in every PL except Python, underscoring the impact of code-centric pretraining. Encoder-decoder architectures (e.g., CodeT5+) lag behind decoder-only models.
- NL Resource Tier Effects: When NLs are grouped by data availability (resource class), pass@1 for "Class 5" (e.g., English, Spanish, Chinese) is ~5 percentage points higher than for "Class 3" (e.g., Tagalog, Estonian) on Python tasks.
- PL Difficulty: Python tasks are consistently easier, while Scala and Go are the most challenging for all model classes.
5. Comparative Context: HumanEval-XL, HumanEval-X, and mHumanEval
| Benchmark | #Natural Languages | #Programming Languages | Prompt Generation |
|---|---|---|---|
| HumanEval | 1 | 1 (Python) | English-only, manual |
| HumanEval-X | 5 | 5 | Manual rewrites |
| HumanEval-XL | 23 | 12 | Automatic + manual QA |
| mHumanEval | 204 (15 human) | 25 | MT engines + expert select |
HumanEval-XL occupies a midpoint between the highly controlled (but narrow) HumanEval-X (Zheng et al., 2023) and the much broader, MT-driven mHumanEval (Raihan et al., 19 Oct 2024). The latter extends prompt coverage to 204 NLs and 25 PLs, using a battery of MT engines, expert curation for 15 NLs, and multi-metric prompt validation. Both HumanEval-XL and mHumanEval use parallel task structures and BERTScore-based QA, but HumanEval-XL maintains a higher proportion of human-in-the-loop review across all NLs, while mHumanEval prioritizes scale and MT diversity.
6. Practical Access, Licensing, and Workflow
HumanEval-XL is publicly hosted at https://github.com/FloatAI/humaneval-xl and is permissively licensed for research and redistribution. The turnkey evaluation workflow comprises:
git clone https://github.com/FloatAI/humaneval-xl.gitpip install -r requirements.txtpython eval.py --lang <NL> --pl <PL> --model <MODEL_NAME>
Each prompt instance specifies both NL and PL to enable fine-grained cross-lingual and cross-PL analysis.
7. Implications and Outlook
HumanEval-XL establishes a rigorous, parallelized framework for evaluating LLM generalization from diverse natural languages into multiple programming languages. Its systematic construction pipeline and validation protocol position it as a core benchmark for future research in NL→code generation. The resource tier performance divergence and gap between code-specialized and general LLMs suggest directions for targeted pretraining, data augmentation, and hybrid evaluation strategies. A plausible implication is that further expansion to low-resource NLs and more diverse PLs—combined with deeper unit test coverage—may reveal deeper insights into LLMs' true cross-lingual program synthesis capacity. Integration with the methodologies of larger-scale benchmarks such as mHumanEval could propel the next wave of multilingual code generation assessment (Peng et al., 26 Feb 2024, Raihan et al., 19 Oct 2024).