TigerCoder: Bangla Code LLM Suite
- TigerCoder is a dedicated Bangla code generation suite that employs transformer-based, decoder-only LLMs with 1B and 9B parameters.
- It leverages a meticulously curated 300K Bangla instruction–code corpus with advanced filtering for syntactic and semantic quality.
- Empirical evaluations on MBPP-Bangla show superior Pass@1 performance, outperforming larger multilingual baselines.
The TigerCoder family comprises the first dedicated suite of LLMs for code generation in Bangla, addressing a crucial underrepresentation of Bangla in code-centric language modeling. This suite consists of two transformer-based, decoder-only models with parameter counts of approximately 1 billion (1B) and 9 billion (9B), both derived from Bangla-specialized TigerLLM checkpoints. TigerCoder emphasizes adaptation to the programming domain through carefully curated instruction–code datasets and is evaluated with MBPP-Bangla, a benchmark specifically constructed for Bangla code generation. Empirical results demonstrate substantial performance gains over existing multilingual and general-purpose LLMs, showcasing the impact of targeted data curation in resource-constrained linguistic domains.
1. Model Architecture and Parameterization
Each TigerCoder model employs a pre-norm decoder-only Transformer architecture, parameterized as follows:
- 1B variant: 24 layers (), 1024 hidden dimensions (), 16 attention heads (), 4096 feedforward inner dimension (), yielding parameters.
- 9B variant: 32 layers (), 4096 hidden dimensions (), 32 attention heads (), 16384 feedforward inner dimension (), yielding parameters.
The total parameter count is estimated by:
The leading term captures the self-attention and feed-forward network parameters, reflecting the scaling logic used in modern Transformer-based LLMs.
Both variants are finetuned via a standard maximum-likelihood cross-entropy objective:
Optimization employs AdamW, with learning rates set to (1B) and (9B), weight decay of 0.02 and 0.04 respectively, and a cosine learning rate schedule with 10–15% warm-up steps over three epochs.
2. Instruction–Code Corpus Construction
TigerCoder's performance is grounded in a 300,000-example Bangla instruction–code corpus, equally partitioned into three distinct 100K subsets:
- Self-Instruct (SI):
- Initiated from 5,000 Bangla prompts authored by experts, spanning algorithms, data structures, file I/O, string operations, mathematics, and basic OOP.
- The Self-Instruct pipeline, using GPT-4o, generated instruction–Python pairs, filtered through both syntactic analysis (via ast.parse) and runtime verification (Python 3.13 sandbox).
- Sentence-level embedding cosine similarity () triggered deduplication for diversity enforcement.
- Synthetic (Syn):
- GPT-4o and Claude 3.5 were instructed in Bangla to produce novel instruction–code pairs.
- All synthetic code passed syntax checks; BERTScore thresholds (≥0.7) ensured minimal paraphrastic redundancy.
- Translated (TE):
- 100,000 high-quality English instruction–code pairs from Evol-Instruct were machine-translated into Bangla using NLLB-200, maintaining the original Python code.
- Three machine translations were generated per prompt; selection criteria included CometKiwi QE () and BERTScore F1 ().
Each subset underwent stringent filters for linguistic fidelity, semantic diversity, and code correctness, yielding a comprehensive resource that covers human, LLM-synthesized, and translation-derived code instructions.
3. Benchmarking: MBPP-Bangla and Evaluation Protocol
The MBPP-Bangla benchmark comprises 974 programming problems drawn from beginner to intermediate levels. Each problem was translated into Bangla by two independent TOEFL-certified native speakers, with adjudication by a polyglot expert, and mapped to canonical reference solutions across Python, Java, JavaScript, Ruby, and C++.
Key covered topics include:
- String manipulation
- Mathematical computations
- Data structures
- Algorithms
- File I/O
Pass@K is used as the principal metric, defined as
where is the number of sampled generations, is the number of correct generations, and is the shortlist size (). This metric captures probabilities for single-shot correctness (), shortlist-wide, and upper-bound performance.
4. Empirical Results and Comparative Performance
TigerCoder models were benchmarked against several multilingual open-source (LLaMA-3.2, Gemma 3, Phi-4, Pangea) and proprietary (GPT-3.5, GPT-4o-mini, Gemini-2.5) baselines on mHumanEval-Bangla and MBPP-Bangla.
| Model | Params | mHumanEval-Bangla (Pass@1) | MBPP-Bangla (Pass@1) | Δ vs. Strongest Baseline |
|---|---|---|---|---|
| TigerCoder 1B | 0.69 | 0.74 | +0.04 to +0.08 | |
| TigerCoder 9B | 0.75 | 0.82 | +0.11 to +0.18 |
The 1B model, despite its modest size, outperforms baselines up to 27 times larger by 4–8 percentage points on Pass@1. The 9B variant achieves Pass@1 improvements in the range compared to the strongest prior models (Gemma 3 27B, TigerLLM 9B). These relative improvements sustain across and .
5. Limitations and Known Constraints
- The instruction corpus is predominantly Python-centric, potentially limiting cross-language generalization.
- MBPP-Bangla, while covering five programming languages and topical areas, cannot comprehensively represent the full scope of real-world coding tasks.
- TigerCoder is restricted to the 1B and 9B parameter regimes; no larger or multimodal models are currently available within this family.
- Automated syntactic and semantic checks, despite their rigor, may not catch subtle semantic faults or domain-specific edge cases.
- The curation process, while thorough, remains partially reliant on automated filtering and scoring mechanisms.
A plausible implication is that further improvements could be realized via multi-language corpus expansion, increased human-in-the-loop validation, and experimentation with larger or multimodal architectures.
6. Practical Applications and Open-Source Impact
TigerCoder models enable a range of practical use cases for Bangla-speaking educators, learners, and developers:
- Localized coding assistants for Bangla-medium environments
- Automated template code generation in educational settings
- Bridging the digital literacy gap among Bangla-speaking software engineers
Datasets, benchmarks (MBPP-Bangla), and model weights are fully open-sourced under permissive licenses, supporting reproducibility, community scrutiny, and facilitating analogous efforts for other low-resource languages.
7. Research Contributions and Broader Significance
The TigerCoder initiative advances the state of LLM-based code generation in low-resource languages by:
- Demonstrating that meticulously curated instruction–code datasets can compensate for smaller parameter counts, enabling 1B-scale models to surpass much larger multilingual LLMs in Bangla code generation.
- Establishing robust benchmarks (MBPP-Bangla) and screening methodologies for linguistic fidelity, diversity, and code correctness.
- Offering empirical evidence that targeted data curation represents a cost-effective strategy for high-performance language technology development in under-represented domains.
This supports the broader proposition that tailored pretraining and rigorous dataset design provide a promising path for elevating model competence in resource-scarce linguistic contexts.