Papers
Topics
Authors
Recent
Search
2000 character limit reached

PolyHumanEval: Multilingual Code Translation Benchmark

Updated 16 February 2026
  • The paper introduces PolyHumanEval, a multilingual benchmark that rigorously evaluates function-level code translation with handcrafted, semantically equivalent test cases.
  • It employs innovative methods such as intermediary translation with a Go pivot and self-training on filtered, model-generated data to significantly boost computational accuracy.
  • Evaluation across various LLMs shows that techniques like signature priming and few-shot exemplars improve accuracy, revealing performance asymmetries between Python-to-other and other-to-Python translations.

PolyHumanEval is a large-scale, multilingual benchmark designed to rigorously evaluate LLMs on function-level code translation tasks across a diverse suite of programming languages. Developed as an extension of the HumanEval corpus, PolyHumanEval addresses critical limitations of prior code translation benchmarks by enforcing semantic equivalence across languages, providing uniform executable test cases, and covering both prevalent and emerging programming languages. It introduces an extensive evaluation infrastructure, new LLM optimization methods for translation, and empirical findings on model limitations and progress, establishing itself as a standard for systematic assessment and advancement in automated code translation (Tao et al., 2024).

1. Motivation and Scope

PolyHumanEval was constructed in response to several deficiencies in existing code-translation benchmarks, such as CodeXGLUE, AVATAR, and g-transeval, which were primarily limited to a small set of languages (e.g., Python, Java) and lacked truly parallel, semantically equivalent data. Existing cross-lingual datasets rarely provided ported solutions that guarantee identical functionality or test coverage in all target languages. Furthermore, the scarcity of parallel multilingual code in LLM pre-training corpora meant that LLMs’ translation capabilities were underexplored and inadequately measured.

PolyHumanEval's explicit goal is rigorous, multilingual evaluation of LLM function-level translation—particularly the translation of full function bodies, including signature, documentation, and code logic, such that outputs are both syntactically and semantically correct in the target language and pass a standardized suite of input-output tests (Tao et al., 2024).

2. Benchmark Construction and Dataset Design

PolyHumanEval extends the HumanEval dataset, which contains 164 canonical Python programming problems, each with a solution and thorough tests. The extension process included the following key elements:

  • Handcrafted Parallel Solutions: For each problem, expert engineers manually ported and peer-reviewed solutions into 13 additional languages: C++, C#, Dart, Go, Java, JavaScript, Kotlin, PHP, Ruby, Rust, Scala, Swift, and TypeScript, resulting in functionally equivalent code for a total of 14 languages.
  • Uniform Test Harnesses: Metadata for each problem—spanning function signature and representative test cases—was used to generate compilable, runnable test files for each language via a custom script, ensuring every solution is validated against semantically equivalent, language-adapted tests.
  • Data Structure: For each language–problem pair, the benchmark includes the function signature, the fully handwritten solution, and a test harness. This enables per-language and cross-language functional validation (Tao et al., 2024).

A key methodological choice in the dataset’s construction was peer-reviewed, handcrafted porting rather than automatic translation, addressing semantic drift and language-specific idioms.

3. Evaluation Protocol and Metrics

PolyHumanEval evaluates code translation via “computational accuracy” (CA), defined as the percentage of translated functions that, when synthesized by an LLM and executed with the same test logic as the reference solution, produce correct outputs. The formal computation is:

Acc=#  of correctly translated functionstotal #  of functions×100%\mathrm{Acc} = \frac{\#\;\text{of correctly translated functions}}{\text{total }\#\;\text{of functions}} \times 100\%

To eliminate annotation or coverage bias, every translated function is run in a test harness derived from the metadata for strict enforcement of functional equivalence (Tao et al., 2024).

The evaluation is conducted across all pairs of source and target languages for the full suite of problems, enabling robust cross-lingual analysis. Additional protocol details:

  • Prompt Engineering: Experiments assessed the effect of providing the target function signature in prompts and included both zero-shot and few-shot (2-shot) setups.
  • Test Generation Tools: The benchmark’s released scripts allow fully automated creation of new test cases for further benchmarking or fine-tuning.

4. Experimental Framework and Baseline Analyses

PolyHumanEval’s experimental suite covers four prominent open-source code LLM families: CodeLlama (7B, 13B, 34B; standard, Python-further-trained, and instruction-tuned), StarCoder (base and tuned), CodeGen2.5 (several variants), and CodeGeeX2-6B. Over 110,000 code translation instances were evaluated, traversing 14 languages × 13 target/source pairs × 164 problems × multiple models and prompt settings.

Findings from baseline settings included:

  • Pre-training and Tuning Effects: Further pre-training on Python benefits X→Python translation for smaller models, but degrades Python→X performance. Instruction tuning on natural language tasks can hurt translation accuracy due to conflicting objectives.
  • Few-Shot Improvements: Supplying the target signature (priming) increases CA by ~3.5 points. Two-shot exemplars also improve CA. For example, CodeLlama-13B yields 66.93% CA for Python→X in the 2-shot, signature-primed configuration, with per-language performance ranging from 58.54% (Dart) to 77.44% (Kotlin).
  • Performance Asymmetries: LLMs generally perform better on “X→Python” than “Python→X,” with notable variance among target languages (Tao et al., 2024).

5. Methods for Enhancing LLM Code Translation

Two complementary LLM optimization strategies were introduced and empirically assessed:

  • Intermediary Translation: Translation is decomposed into two stages via a pivot style or language. The process involves first translating source code to an intermediate style (e.g., from Python list comprehensions to basic for-loops) or to a “pivot” programming language, then translating the intermediate output to the final target language. Go was empirically found to be the most effective “lingua franca” across the evaluated languages. The Go pivot strategy yielded an average gain of +4.73 CA points (baseline 66.93% → 71.66%) for Python→Other translation, with per-language boosts up to +11.58.
  • Self-Training on Model-Generated Parallel Data: The model generates Python functions and test cases, filters for correctness, translates to the target language, filters again for passing cases, and aggregates passing pairs for LoRA-based fine-tuning. Using filtered (Pass@5) data for Python→Go, CA increased by +8.53 points (baseline 58.54% → 67.07%) and by +4.5 points averaged across all directions.
  • Combined Optimization: Jointly applying Pass@5 self-training and Go-pivot intermediary translation yielded a net average CA improvement of +11.84 points for Python→Other tasks (66.93% → 74.77%) (Tao et al., 2024).

6. Empirical Results and Insights

Key empirical outcomes highlighted by PolyHumanEval include:

  • Baseline Results: For CodeLlama-13B, 2-shot translation with target signature yields 66.93% average CA for Python→Other, with highest per-language CA for Kotlin (77.44%) and lowest for Dart (58.54%).
  • Translation Optimization: Both intermediary language and self-training methods demonstrably enhance functional fidelity, with the Go pivot consistently serving as the optimal translation bridge.
  • Persisting Gaps: Despite these gains, performance for Python-to-other-language translations remains considerably behind perfect accuracy. LLMs also show variable comprehension and generation proficiency depending on the language.
  • Scaling Considerations: The benchmark scope remains function-level and does not extend to larger codebases or multi-file projects (Tao et al., 2024).

7. Limitations and Future Directions

Current limitations include language imbalance (LLMs are generally stronger on “X→Python”), function-level context restriction, and benchmark coverage limited to 14 languages. Suggested avenues for future work are:

  • Automatic Pivot Selection: Employ language embeddings or performance-based predictors to dynamically select pivots.
  • Data Diversity: Expand synthetic data generation for self-training to encompass broader API coverage, longer code snippets, and richer programming paradigms.
  • Real-World Applicability: Generalize PolyHumanEval to open-source repositories, facilitate cross-file and modular evaluation, and develop additional metrics for readability, compilation success, API correctness, and human evaluation.
  • Comprehensive Metrics: Track secondary properties such as code clarity or maintainability alongside computational accuracy (Tao et al., 2024).

PolyHumanEval provides a rigorously validated, large-scale benchmark and demonstrates practical LLM improvement strategies, enabling systematic research and progress in automated, multilingual code translation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolyHumanEval Benchmark.