Bangla Code Instruction Datasets

Updated 14 September 2025

Bangla code instruction datasets are specialized corpora designed to train and evaluate LLMs for Bangla code generation using expert seeding, synthetic expansion, and translation methods.
They incorporate rigorous quality control measures, including syntax validation, BERTScore diversity checks, and runtime execution, ensuring high data fidelity.
Fine-tuning approaches such as LoRA on models like BongLLaMA and TigerCoder demonstrate significant performance improvements on benchmarks like MBPP-Bangla.

Bangla code instruction datasets are specialized language resources constructed to advance the adaptation and evaluation of LLMs for code generation and programmatic reasoning in the Bangla language. These corpora are engineered with domain-specific prompts, reference code, and rigorous quality validation at scale, aiming to overcome the scarcity of high-quality Bangla programming data and to reduce performance gaps between English-centric and Bangla code generation systems.

1. Composition and Taxonomy of Bangla Code Instruction Datasets

The principal Bangla code instruction datasets, as exemplified by the works in "TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla" (Raihan et al., 11 Sep 2025) and "BongLLaMA: LLaMA for Bangla Language" (Zehady et al., 28 Oct 2024), are constructed via complementary stratification to maximize diversity and practicality. The "Bangla-Code-Instruct" suite comprises three distinct partitions, each with 100,000 instruction–code pairs:

Dataset Name	Creation Source	Validation Method
Bangla-Code-Instruct-SI	Human seed + Self-Instruct	Syntax + Execution + Diversity
Bangla-Code-Instruct-Syn	Synthetic (LLM-generated)	Syntax + BERTScore Diversity
Bangla-Code-Instruct-TE	High-quality translation	CometKiwi QE + BERTScore F1

Bangla-Code-Instruct-SI utilizes 5,000 manually-authored seed prompts by experts, expanded using GPT-4o, with paired Python code verified for syntax and execution within a controlled environment (Python 3.13, 10s timeout, 16GB RAM). A cosine similarity threshold ( $\geq$ 0.95) filters instructions for diversity. This produces "semi-natural" prompts tightly coupled to authentic programming workflows.
Bangla-Code-Instruct-Syn is generated by large proprietary LLMs (GPT-4o, Claude-3.5-Sonnet), using BERTScore F1 ( $<0.7$ ) for intra-dataset diversity, with code validated as above.
Bangla-Code-Instruct-TE translates English instruction–code pairs (notably from Evol-Instruct) into Bangla via NLLB-200. Only translations with CometKiwi QE scores $>0.85$ and BERTScore F1 $>0.95$ are selected, thereby enforcing high linguistic and semantic fidelity.

Additionally, the MBPP-Bangla benchmark (Raihan et al., 11 Sep 2025) comprises 974 programming tasks, each annotated and adjudicated by native experts, including translated prompts and multi-language reference solutions (Python, Java, JavaScript, Ruby, C++). These benchmarks are packaged as JSONLines, encompassing task metadata, reference code, test cases, and error analysis instrumentation.

The Bangla-Alpaca-Orca dataset (Zehady et al., 28 Oct 2024) extends the Alpaca and OpenOrca English instruct corpora by automatic translation (Google Translation API) followed by manual curation to correct translation inaccuracies and infuse Bengali cultural context. The resulting corpus contains 172,000 high-quality instructions spanning diverse task categories, including coding prompts expressive in Bangla.

2. Data Augmentation and Quality Control Strategies

Bangla code instruction datasets institute robust data augmentation and quality assurance mechanisms:

For translation-based augmentation, the Google Translation API and NLLB-200 models are used as baseline tools. Manual post-editing addresses semantic loss and cultural incongruity, integrating local idioms and contextual elements (e.g., references to Bangladeshi events).
In synthetic and self-instruct splits, code is validated for syntax using Python’s ast.parse and for correctness via runtime execution in a sandboxed environment. Diversity among instructions is maintained by BERTScore and cosine similarity metrics, with thresholds constraining redundancy.
Translated instructions are vetted using CometKiwi QE and BERTScore; only instances meeting stringent thresholds ( $>$ 0.85, $>$ 0.95 respectively) are retained.

Such measures result in corpora that not only encompass the syntactic span of Bangla instructions but also mitigate common pitfalls such as code-instruction drift, homogeneity, and translation artifacts. This practice directly addresses the quality requirements for low-resource settings, reducing reliance on massive yet generic multilingual data.

3. Fine-Tuning Methodologies and Model Architectures

Instruction datasets are utilized for domain adaptation of LLMs through specialized fine-tuning protocols:

BongLLaMA models (Zehady et al., 28 Oct 2024) employ the LLaMAForCausalLM architecture with AutoTokenizer. The fine-tuning is performed using Low-Rank Adaptation (LoRA), which injects task-specific signal efficiently by adjusting linear layers without full-parameter retraining. Training uses AdamW optimizer, a maximum token sequence length of 4096, a cosine scheduler, and gradient accumulation. All computations are performed at 16-bit floating point (bf16(auto)).
TigerCoder LLMs (Raihan et al., 11 Sep 2025), available in 1B and 9B parameter scales, are fine-tuned on the full 300,000 example corpus using similar protocols. Hyperparameter selection is informed by LoRA rank, Alpha, step scheduling, and empirical optimization for Pass@K metric maximization.

These methodological choices facilitate rapid and resource-efficient adaptation, permitting smaller models to approximate or surpass the accuracy of much larger, multilingual foundation models on Bangla code-centric tasks.

4. Evaluation Benchmarks and Performance Metrics

The principal metric for code generation evaluation is Pass@K, defined as:

$\text{Pass@}K = 1 - \frac{\binom{n-m}{K}}{\binom{n}{K}}$

where $n$ is the total programs sampled per problem, $m$ is the count of correct programs passing all test cases, and $K$ indicates the number of samples scored. This formulation enables robust quantification of code generation performance against the MBPP-Bangla benchmark.

Empirical results reveal that the TigerCoder LLMs achieve absolute improvements of $11\text{–}18\%$ in Pass@1 compared to baseline Bangla and multilingual models; specifically, the 1B TigerCoder model outperforms models 27 times larger by $4\text{–}8$ percentage points (Raihan et al., 11 Sep 2025). BongLLaMA, tuned on the Bangla-Alpaca-Orca dataset, demonstrates marked improvements on reasoning and code generation tasks expressed in Bangla (Zehady et al., 28 Oct 2024).

In related tasks, the TB-OLID dataset (Raihan et al., 2023) shows that English pre-trained transformer models (fBERT, HateBERT), when fine-tuned on transliterated and code-mixed Bangla, excel in offensive language identification (F1 ≈ 0.72) and fine-grained target classification (F1 = 0.68), outstripping monolingual Bangla models on Latin-script data. This suggests analogous transfer strategies may enhance Bangla code instruction modeling in environments with non-standard transliteration and code-mixing.

5. Addressing Low-Resource Language Challenges

Bangla, despite being the 5th most-spoken language globally, is classified as low-resource in the context of NLP and code generation tasks. The limitations chiefly stem from a lack of curated, domain-relevant instruction-code data.

A core insight substantiated by these works (Raihan et al., 11 Sep 2025, Zehady et al., 28 Oct 2024) is that careful curation—integrating human-authored, synthetic, and translation-based prompts with rigorous validation—enables smaller models, through domain-adaptive fine-tuning, to reach performance levels associated traditionally with much larger, English-centric architectures. Such datasets also facilitate models’ effective generalization across code-mixed and transliterated tasks, provided preprocessing and model architectures are tailored to handle spelling variability, OOV incidence, and semantic drift.

The open-sourcing of the Bangla-Code-Instruct datasets, MBPP-Bangla benchmark, and BongLLaMA resources significantly lowers the entry barrier for further research and strengthens transparency, reproducibility, and community engagement.

6. Implications and Applicability to Multilingual and Code-Mixed Settings

The methodology and findings in Bangla code instruction dataset construction hold direct relevance for adjacent tasks and languages:

The success of instruction tuning on translated and culturally adapted data demonstrates the efficacy of hybrid human-machine curation workflows, especially for low-resource languages with rich contextual and idiomatic content.
Performance gains achieved by fine-tuning English pre-trained transformer models (e.g., fBERT, HateBERT) on transliterated Bangla (Raihan et al., 2023) indicate that code-mixed corpora and models pre-trained on English remain valuable for tasks where significant English vocabulary permeates.
A plausible implication is that similar strategies—combining expert seeding, synthetic expansion, multilingual translation, and rigorous validation—will accelerate progress on code generation for other underrepresented languages, particularly when combined with comprehensive benchmarks akin to MBPP-Bangla.

7. Contextualization and Prospects for Future Research

Bangla code instruction datasets are pivotal in bridging the performance gap between English-dominant and Bangla code generation systems. They represent a robust framework for training, evaluating, and benchmarking LLMs, fostering systematic advancement in language technologies for low-resource contexts.

Continued efforts toward expanding corpus diversity, enhancing quality assurance protocols, and refining benchmark standards hold promise for future research in multilingual code LLMs, instruction following, and transliterated/code-mixed language processing. The demonstrated improvements in both performance metrics and practical code generation fidelity underscore the lasting significance of these datasets in advancing Bangla NLP and adjacent domains.

PDF Markdown Chat (Pro)

References (3)

TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla (2025)

BongLLaMA: LLaMA for Bangla Language (2024)

Offensive Language Identification in Transliterated and Code-Mixed Bangla (2023)

Follow Topic

Get notified by email when new papers are published related to Bangla Code Instruction Datasets.