Papers
Topics
Authors
Recent
Search
2000 character limit reached

mHumanEval-Bangla Benchmark

Updated 17 April 2026
  • mHumanEval-Bangla is a benchmark adapting HumanEval to Bangla, featuring native instructions and standardized Python test suites.
  • It employs a rigorous two-stage translation and expert review process using metrics like BERTScore and CometKiwi for high technical fidelity.
  • The BanglaCodeAct framework uses an iterative Thought–Code–Observation loop to achieve up to 94% pass@1 on key models.

mHumanEval-Bangla is a Bangla-translated and Bangla-native extension of the HumanEval benchmark designed to advance the evaluation of code generation systems for low- and mid-resource languages. Originating from the broader mHumanEval project, which supports 204 natural languages, mHumanEval-Bangla facilitates the benchmarking of LLMs on Bangla-to-Python code generation tasks. The benchmark standardizes the evaluation of LLMs' abilities to generate correct Python programs from natural-language specifications written in Bangla, a mid-resource language, and forms the core testbed for contemporary Bangla NL2Code research efforts (Islam et al., 27 Nov 2025, Raihan et al., 2024).

1. Dataset Composition and Curation

mHumanEval-Bangla comprises 164 programming problems, matching the size and coverage of the original HumanEval benchmark (Raihan et al., 2024). Each problem consists of a Bangla-language instruction, a canonical Python function signature, and a concise set (3–5) of unit tests. These instructions are presented in grammatically correct, technically precise Bangla, achieved through a two-stage translation and validation pipeline:

  • Multiple machine translation systems (GPT-4o, NLLB, Google Translate) produce candidate Bangla versions for each English prompt.
  • For 15 focus languages, including Bangla, native speakers with CS/IT backgrounds create expert translations, followed by a secondary review for technical fidelity.
  • Candidates are evaluated using BERTScore (reference-based, leveraging round-trip translation) and CometKiwi (reference-free, human-judgment-learned). The selected prompt for each problem maximizes the average of these metrics, yielding average values of BERTScore ≈ 0.942 and CometKiwi ≈ 0.849 for Bangla (Raihan et al., 2024).

Each task retains the original canonical Python reference solution and test suite from HumanEval, ensuring cross-language evaluation comparability.

2. Evaluation Metrics and Protocol

Performance on mHumanEval-Bangla is measured using the pass@k metric common to code generation evaluation. For a given task, the pass@k represents the estimated probability that at least one of k samples generated by the model passes all unit tests:

pass@k=1i=0k1ncini\mathrm{pass}@k = 1 - \prod_{i=0}^{k-1} \frac{n - c - i}{n - i}

where nn is the number of generated samples and cc is the count of correct samples. In standard practice, pass@1 is reported, reflecting the proportion of tasks for which the model's primary sample is fully correct (Islam et al., 27 Nov 2025, Raihan et al., 2024).

The unit test suites remain unaugmented relative to the original HumanEval. This design decision allows direct comparison with prior work but also inherits the coverage limitations of the HumanEval's test harness (Raihan et al., 2024).

3. BanglaCodeAct: Agent-Based Framework for Bangla NL2Code

BanglaCodeAct is an open-source agent-based framework that structures Bangla-to-Python code generation as a sequence of multi-agent interactions. It implements an iterative “Thought–Code–Observation” loop with the following agents per turn (Islam et al., 27 Nov 2025):

  • Thought Agent: Produces a Bangla-language reasoning trace, outlining the algorithmic approach.
  • Code Agent: Emits a Python implementation encapsulated in a <code>…</code> block, accompanied by the suite of Bangla-translated test assertions.
  • Observation Agent: Executes the candidate code in a sandboxed REPL; records whether all tests pass or returns the exception trace.

This loop is repeated with a maximum iteration budget (T = 10) or until all tests are passed. Iterative self-correction enables dynamic refinement based on execution feedback.

A representative pseudocode for the iterative process is provided below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
procedure BANG_LA_CODE_ACT(prompt, max_iter=10):
    history  []
    for t in 1max_iter:
        thought  LLM.generate(
            system="You are an assistant reasoning in Bangla.",
            user=[history, "<step> দেখুন পরিকল্পনা:" + prompt]
        )
        code_snippet  LLM.generate(
            system="Now write Python code based on the above thought.",
            user=thought
        )
        result  safe_run(code_snippet, tests, timeout=5s)
        history.append(("<thought>", thought),
                       ("<code>", code_snippet),
                       ("<obs>", result))
        if result.status == PASS:
            return code_snippet
    return FAILURE

The underlying LLM for BanglaCodeAct is Qwen3-8B (8B parameters, 64k subword SentencePiece tokenizer), selected for its multilingual training corpus, including Bangla, and operated in zero-shot or few-shot mode via engineered prompts. Inference employs vLLM with tensor parallelism and robust multi-turn safe execution primitives.

4. Empirical Results and Comparative Analysis

Experimental evaluation on mHumanEval-Bangla demonstrates the efficacy of the agent-based iterative correction procedure. Table 1 summarizes key findings on the dev and test sets:

Model Method dev set test set
Qwen3-8B BanglaCodeAct 94.0% 71.6%
Qwen3-8B Self-Consistency 88.0%
Qwen3-8B Majority Voting 66.0%
Qwen3-8B Few-Shot Prompting 46.0%
Qwen3-8B Zero-Shot Prompting 36.0%
Llama-3.1-8B Few-Shot Prompting 77.0%
DeepSeek-Coder BanglaCodeAct 73.8%
TigerLLM-1B-it Zero-Shot Prompting 11.0%

Statistical significance is confirmed: the difference between BanglaCodeAct (94.0%) and the next best baseline (88.0%) is p<0.01p < 0.01 under a two-sided bootstrap with B=10000B = 10\,000 resamples. For Qwen3-8B with BanglaCodeAct, pass@5 further increases to 98.2% on the development set, indicating the high coverage achievable with multiple samples (Islam et al., 27 Nov 2025).

For comparison, leading proprietary models on mHumanEval-Bangla (Claude 3.5-Opus, GPT-4o) both yield pass@1 of approximately 80%. This establishes the open-source agent-based architecture as competitive for Bangla NL2Code (Raihan et al., 2024).

5. Design Insights and Error Analysis

Ablation experiments reveal that removing the Thought agent and using a direct code-execute loop degrades pass@1 from 94.0% to 80.5%, indicating the importance of intermediate reasoning traces. Limiting the number of self-correction iterations to 3 yields 85.3% (vs. 94.0% at 10 iterations), illustrating that deeper correction loops are beneficial. Reducing model capacity (e.g., substituting Qwen2.5-7B) diminishes pass@1 to 78.4% (Islam et al., 27 Nov 2025).

Error analysis identifies two primary classes of failures: instruction ambiguity (e.g., imprecision in Bangla prompt semantics) and performance bottlenecks on tasks with deeply nested logic or recursion. Successful solutions typically converge in 4 iterations, while failures often reach the 10-iteration ceiling.

Qualitative examples demonstrate stepwise error localization and correction. For instance, in a string manipulation task, the initial code passes the slicing but fails on character-specific tests; iterative refinement then yields the correct implementation.

6. Comparative Position and Best Practices

mHumanEval-Bangla sets a multilingual code generation evaluation precedent, providing a fully localized, expert-validated testbed for Bangla. Code generation quality in Bangla (≈0.79–0.94 pass@1 depending on system) is consistently lower than English (≈0.92–0.94), highlighting the relative performance gap for mid-resource languages. Error sources are predominantly due to insufficient Bangla exposure during pretraining rather than the translation process itself (Raihan et al., 2024).

Best practices for extending such multilingual benchmarks include combination of multi-engine translation pipelines, robust round-trip evaluation metrics (BERTScore, CometKiwi), and expert human review of technical prompts. Canonical solutions and test harnesses in a fixed programming language (Python) standardize cross-language comparison.

7. Limitations and Future Directions

Key limitations of mHumanEval-Bangla and associated frameworks include:

  • Model size ceiling: Only ≤8B parameter models were evaluated due to hardware availability; larger LLMs may further close the performance gap.
  • Dependency on high-quality unit tests: The approach presumes comprehensive tests for all Bangla code tasks, but such coverage may not be available in real-world scenarios.
  • Persistent challenges with semantic ambiguity in underspecified or context-dependent Bangla instructions.

Potential future enhancements include integrating retrieval-augmented techniques for grounding, expanding the suite with harder or more diverse programming tasks, and introducing human-in-the-loop clarification for ambiguous instructions (Islam et al., 27 Nov 2025).

In summary, mHumanEval-Bangla provides a rigorous, reproducible platform for evaluating Bangla NL2Code systems, and agent-based iterative self-correction significantly enhances performance on this benchmark (Islam et al., 27 Nov 2025, Raihan et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to mHumanEval-Bangla.