MBPP-Bangla: Bangla Code Evaluation Benchmark
- MBPP-Bangla is a benchmark for evaluating Bangla natural language prompts and code synthesis, developed from translated MBPP tasks with validated programming intent.
- The benchmark employs a multi-phase construction with expert translation and adaptation of reference solutions across five programming languages.
- Empirical results with TigerCoder LLMs reveal substantial pass@K gains, underscoring the benefits of language-targeted, low-resource model fine-tuning.
MBPP-Bangla defines a code generation evaluation benchmark specifically crafted for the Bangla language, drawing from the original Mostly Basic Python Programs (@@@@1@@@@) corpus. Developed for use in conjunction with the new TigerCoder suite of LLMs for Bangla code generation (Raihan et al., 11 Sep 2025), MBPP-Bangla serves as an indispensable tool for quantifying a model’s ability to interpret and generate code from Bangla natural language prompts—a capability not robustly assessed in preexisting resources.
1. Motivation and Benchmark Construction
The central motivation for MBPP-Bangla is to create a rigorous, language-specific evaluation resource targeting the intersection of natural language understanding and code synthesis for low-resource languages (primarily Bangla). Most code generation benchmarks rely on English prompts; when these are translated, especially into underrepresented languages, model performance often degrades sharply. MBPP-Bangla provides native Bangla speakers and Bangla NLP researchers with a controlled, reproducible, and technically challenging evaluation bed.
The construction pipeline is multi-phased:
- Extraction of 974 diverse programming tasks from the canonical MBPP English dataset, covering a wide array of computational patterns (algorithms, data structures, arithmetic, string manipulation, and file I/O).
- Each natural language task prompt is independently translated into Bangla by highly proficient bilingual speakers, then validated by technical experts for both semantic accuracy and programming intent.
- For each task, reference code solutions are adapted for five programming languages (Python, Java, JavaScript, Ruby, C++), with manual interventions to ensure idiomatic and functional correctness relative to the original author’s specification.
- Problems are further labeled by topic to allow for granular downstream analysis of area-specific model competence.
2. Benchmark Structure and Data Format
MBPP-Bangla is distributed in a JSONLines format; each instance includes:
- A unique task identifier
- The Bangla instruction/prompt (meticulously translated from English)
- Canonical reference solutions in five programming languages
- The original suite of test cases (ported and validated across languages)
- A categorical topic annotation
This structure enables LLMs to be evaluated not only on Python (the MBPP default) but also on a spectrum of target languages, incentivizing truly language-agnostic code synthesis.
MBPP-Bangla Problem Record Table (excerpt; pseudostructure):
Field | Description | Example Value |
---|---|---|
id | Task unique identifier | "bangla_0032" |
prompt_bn | Bangla language program instruction | "একটি ফাংশন লেখ যা দুটি সংখ্যার গ.সা.গু নির্ণয় করবে" |
solutions | Code solutions for each language | { "python": "...", "java": "...", ... } |
test_cases | Unit tests to validate correctness | [ { "input": "...", "output": "..." }, ... ] |
topic | Problem category label | "Math" |
3. Evaluation Metrics and Protocols
Evaluation on MBPP-Bangla leverages the Pass@K metric, which measures the probability that at least one of K generated programs passes all provided unit tests for a given task. The formal definition is:
where is the total number of programs generated per task and is the number that pass all tests. Pass rates are reported at (single-shot), (practical shortlist), and (exhaustive sampling), providing a spectrum between realistic user interactions and upper-bound model capability.
4. Empirical Findings: TigerCoder LLMs on MBPP-Bangla
Application of MBPP-Bangla in the TigerCoder project reveals several salient empirical results:
- The TigerCoder-1B model (1B parameters), tailored with Bangla code instruction datasets, outperforms much larger multilingual LLMs (with up to 27 times more parameters) by 4–8 percentage points in pass@1—a strong indicator that domain- and language-targeted curation supersedes scale alone for the Bangla code domain.
- TigerCoder-9B achieves even higher gains: ~11–18% absolute increase in pass@1 (and similar trends at pass@10 and pass@100) when compared with state-of-the-art open and proprietary code models, demonstrating the critical role of benchmarking with MBPP-Bangla for low-resource programming language contexts.
- Detailed analytics by topic and programming language reveal that performance disparities (e.g., lower scores on advanced data processing vs. high scores on string tasks) are well preserved across translation, reflecting both the original MBPP structure and the efficacy of the translation and curation process.
Performance Summary Table (Pass@1 Example):
Model | Parameters | Pass@1 (%) | Relative Gain vs. Baseline |
---|---|---|---|
TigerCoder-1B | 1B | X | +4–8% |
TigerCoder-9B | 9B | Y | +11–18% |
Multilingual LLM | 27B | (lower) | — |
(Specific scores “X” and “Y” as reported in the cited data)
5. Design Considerations and Benchmarking Insights
MBPP-Bangla’s unique dual-focus—on both natural language understanding in Bangla and precise code synthesis—surfaces critical issues:
- Many multilingual LLMs exhibit strong performance degradation when Bangla instructions are supplied, often failing to parse instructions accurately or defaulting to English code documentation.
- High-quality, topic-diverse, and technically precise natural language prompts are crucial to discriminating true model competence from mere memorization or English-centric pattern matching.
- The benchmark’s cross-language solution mapping enables researchers to assess transfer and generalization across programming languages, an aspect not typically captured in other monolingual testbeds.
A plausible implication is that targeted fine-tuning on appropriately translated and validated code generation datasets can compensate for reduced model parameter count in low-resource language settings, optimizing both research cost and deployment practicalities.
6. Impact and Research Trajectory
MBPP-Bangla, in tandem with the release of domain-tuned TigerCoder models, constitutes a substantive advance in the infrastructure for Bangla code generation research. Its introduction:
- Provides the first openly available, large-scale evaluation benchmark of this type for Bangla code LLMs, addressing a significant gap in language-inclusive NLP and code synthesis evaluation.
- Demonstrates that careful benchmark design, comprising linguistically and technically sound translations, solution verification, and domain labeling, is essential for credible multilingual and low-resource evaluation.
- Sets a baseline for reproducible model comparison and encourages further dataset expansion, method development, and cross-lingual technology transfer for programming in low-resource settings.
The benchmark and associated models are released open-source, facilitating both further benchmarking research and practical integration in educational, professional, and broader NLP domains for Bangla programmers and learners.