Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 74 tok/s
Gemini 2.5 Flash 163 tok/s Pro
Gemini 2.5 Pro 46 tok/s Pro
Kimi K2 200 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

TigerCoder: Bangla Code LLM Suite

Updated 10 November 2025
  • TigerCoder is a dedicated Bangla code generation suite that employs transformer-based, decoder-only LLMs with 1B and 9B parameters.
  • It leverages a meticulously curated 300K Bangla instruction–code corpus with advanced filtering for syntactic and semantic quality.
  • Empirical evaluations on MBPP-Bangla show superior Pass@1 performance, outperforming larger multilingual baselines.

The TigerCoder family comprises the first dedicated suite of LLMs for code generation in Bangla, addressing a crucial underrepresentation of Bangla in code-centric language modeling. This suite consists of two transformer-based, decoder-only models with parameter counts of approximately 1 billion (1B) and 9 billion (9B), both derived from Bangla-specialized TigerLLM checkpoints. TigerCoder emphasizes adaptation to the programming domain through carefully curated instruction–code datasets and is evaluated with MBPP-Bangla, a benchmark specifically constructed for Bangla code generation. Empirical results demonstrate substantial performance gains over existing multilingual and general-purpose LLMs, showcasing the impact of targeted data curation in resource-constrained linguistic domains.

1. Model Architecture and Parameterization

Each TigerCoder model employs a pre-norm decoder-only Transformer architecture, parameterized as follows:

  • 1B variant: 24 layers (L=24L=24), 1024 hidden dimensions (d=1024d=1024), 16 attention heads (h=16h=16), 4096 feedforward inner dimension (f=4096f=4096), yielding 1×109\sim 1 \times 10^9 parameters.
  • 9B variant: 32 layers (L=32L=32), 4096 hidden dimensions (d=4096d=4096), 32 attention heads (h=32h=32), 16384 feedforward inner dimension (f=16384f=16384), yielding 9×109\sim 9 \times 10^9 parameters.

The total parameter count PP is estimated by:

P12Ld2+4Ldf+vocab_sizedP \simeq 12 \cdot L \cdot d^2 + 4 \cdot L \cdot d \cdot f + \text{vocab\_size} \cdot d

The leading term captures the self-attention and feed-forward network parameters, reflecting the scaling logic used in modern Transformer-based LLMs.

Both variants are finetuned via a standard maximum-likelihood cross-entropy objective:

LCE=t=1Tytlogpθ(yty<t)L_{CE} = -\sum_{t=1}^{T} y_t \log p_{\theta}(y_t | y_{<t})

Optimization employs AdamW, with learning rates set to 1×1051 \times 10^{-5} (1B) and 1×1061 \times 10^{-6} (9B), weight decay of 0.02 and 0.04 respectively, and a cosine learning rate schedule with 10–15% warm-up steps over three epochs.

2. Instruction–Code Corpus Construction

TigerCoder's performance is grounded in a 300,000-example Bangla instruction–code corpus, equally partitioned into three distinct 100K subsets:

  1. Self-Instruct (SI):
    • Initiated from 5,000 Bangla prompts authored by experts, spanning algorithms, data structures, file I/O, string operations, mathematics, and basic OOP.
    • The Self-Instruct pipeline, using GPT-4o, generated instruction–Python pairs, filtered through both syntactic analysis (via ast.parse) and runtime verification (Python 3.13 sandbox).
    • Sentence-level embedding cosine similarity (>0.95>0.95) triggered deduplication for diversity enforcement.
  2. Synthetic (Syn):
    • GPT-4o and Claude 3.5 were instructed in Bangla to produce novel instruction–code pairs.
    • All synthetic code passed syntax checks; BERTScore thresholds (≥0.7) ensured minimal paraphrastic redundancy.
  3. Translated (TE):
    • 100,000 high-quality English instruction–code pairs from Evol-Instruct were machine-translated into Bangla using NLLB-200, maintaining the original Python code.
    • Three machine translations were generated per prompt; selection criteria included CometKiwi QE (>0.85>0.85) and BERTScore F1 (>0.95>0.95).

Each subset underwent stringent filters for linguistic fidelity, semantic diversity, and code correctness, yielding a comprehensive resource that covers human, LLM-synthesized, and translation-derived code instructions.

3. Benchmarking: MBPP-Bangla and Evaluation Protocol

The MBPP-Bangla benchmark comprises 974 programming problems drawn from beginner to intermediate levels. Each problem was translated into Bangla by two independent TOEFL-certified native speakers, with adjudication by a polyglot expert, and mapped to canonical reference solutions across Python, Java, JavaScript, Ruby, and C++.

Key covered topics include:

  • String manipulation
  • Mathematical computations
  • Data structures
  • Algorithms
  • File I/O

Pass@K is used as the principal metric, defined as

Pass@K=1(nmK)(nK)\text{Pass@}K = 1 - \frac{{n-m\choose K}}{{n\choose K}}

where nn is the number of sampled generations, mm is the number of correct generations, and KK is the shortlist size (K{1,10,100}K \in \{1, 10, 100\}). This metric captures probabilities for single-shot correctness (K=1K=1), shortlist-wide, and upper-bound performance.

4. Empirical Results and Comparative Performance

TigerCoder models were benchmarked against several multilingual open-source (LLaMA-3.2, Gemma 3, Phi-4, Pangea) and proprietary (GPT-3.5, GPT-4o-mini, Gemini-2.5) baselines on mHumanEval-Bangla and MBPP-Bangla.

Model Params mHumanEval-Bangla (Pass@1) MBPP-Bangla (Pass@1) Δ vs. Strongest Baseline
TigerCoder 1B 1×1091 \times 10^9 0.69 0.74 +0.04 to +0.08
TigerCoder 9B 9×1099 \times 10^9 0.75 0.82 +0.11 to +0.18

The 1B model, despite its modest size, outperforms baselines up to 27 times larger by 4–8 percentage points on Pass@1. The 9B variant achieves Pass@1 improvements in the range Δ=0.110.18Δ = 0.11 \text{–} 0.18 compared to the strongest prior models (Gemma 3 27B, TigerLLM 9B). These relative improvements sustain across K=10K=10 and K=100K=100.

5. Limitations and Known Constraints

  • The instruction corpus is predominantly Python-centric, potentially limiting cross-language generalization.
  • MBPP-Bangla, while covering five programming languages and topical areas, cannot comprehensively represent the full scope of real-world coding tasks.
  • TigerCoder is restricted to the 1B and 9B parameter regimes; no larger or multimodal models are currently available within this family.
  • Automated syntactic and semantic checks, despite their rigor, may not catch subtle semantic faults or domain-specific edge cases.
  • The curation process, while thorough, remains partially reliant on automated filtering and scoring mechanisms.

A plausible implication is that further improvements could be realized via multi-language corpus expansion, increased human-in-the-loop validation, and experimentation with larger or multimodal architectures.

6. Practical Applications and Open-Source Impact

TigerCoder models enable a range of practical use cases for Bangla-speaking educators, learners, and developers:

  • Localized coding assistants for Bangla-medium environments
  • Automated template code generation in educational settings
  • Bridging the digital literacy gap among Bangla-speaking software engineers

Datasets, benchmarks (MBPP-Bangla), and model weights are fully open-sourced under permissive licenses, supporting reproducibility, community scrutiny, and facilitating analogous efforts for other low-resource languages.

7. Research Contributions and Broader Significance

The TigerCoder initiative advances the state of LLM-based code generation in low-resource languages by:

  • Demonstrating that meticulously curated instruction–code datasets can compensate for smaller parameter counts, enabling 1B-scale models to surpass much larger multilingual LLMs in Bangla code generation.
  • Establishing robust benchmarks (MBPP-Bangla) and screening methodologies for linguistic fidelity, diversity, and code correctness.
  • Offering empirical evidence that targeted data curation represents a cost-effective strategy for high-performance language technology development in under-represented domains.

This supports the broader proposition that tailored pretraining and rigorous dataset design provide a promising path for elevating model competence in resource-scarce linguistic contexts.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TigerCoder-Family of Code LLMs.