Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 74 tok/s
Gemini 2.5 Flash 163 tok/s Pro
Gemini 2.5 Pro 46 tok/s Pro
Kimi K2 200 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

BongLLaMA – Bangla LLM Suite

Updated 10 November 2025
  • BongLLaMA is an open-source suite of large language models specifically tailored for Bangla, leveraging Meta’s LLaMA architectures with targeted pretraining and instruction tuning.
  • The models utilize augmented tokenization and extensive Bangla corpora, achieving enhanced performance on generative and reasoning tasks in culturally relevant contexts.
  • Empirical evaluations highlight significant gains in Bangla-centric benchmarks while exposing challenges in coding, translation, and token efficiency, suggesting avenues for further refinements.

BongLLaMA refers to a suite of open-source LLMs developed by Zehady et al. and derived from Meta’s LLaMA-2 and LLaMA-3 architectures to serve Bangla (Bengali), a major low-resource language spoken by over 240 million native speakers worldwide. BongLLaMA models are specialized for Bangla through continued pretraining on large Bangla corpora and targeted instruction-tuning, addressing the shortcomings of multilingual or English-centric LLMs on Bangla language processing tasks. These models have been extensively benchmarked on Bangla-centric tasks and have set a baseline for instruction-following and generative capabilities in Bangla. The models are publicly available via Hugging Face and have subsequently been referenced as baselines for further Bangla LLM research, notably in comparative studies such as TituLLMs (Zehady et al., 28 Oct 2024, Nahin et al., 16 Feb 2025).

1. Model Architecture and Tokenization

BongLLaMA encompasses several model instantiations mapped to their LLaMA predecessors:

Model Variant Base Model Parameters Context Window
BongLLaMA2-7B LLaMA-2-7B 7B 4,096
BongLLaMA3-8B LLaMA-3-8B 8B 8,192
BongLLaMA3.1-8B LLaMA-3.1-8B 8B 128,000
BongLLaMA3.2-1B LLaMA-3.2-1B 1B 128,000
BongLLaMA3.2-3B LLaMA-3.2-3B 3B 128,000

The architectures preserve the characteristics of their LLaMA bases: pre-normalization, SwiGLU activations, rotary positional embeddings, and grouped-query attention for LLaMA-2 and multimodal layers in LLaMA-3. For BongLLaMA2-7B, tokenization is augmented by extending the LLaMA-2 vocabulary with 18,000 additional Bangla-specific tokens (yielding 50,000 total), using a Bangla Wikipedia-trained SentencePiece model. LLaMA-3 based BongLLaMA variants rely on the default multilingual 132,000-token vocabulary. However, subsequent studies, notably TituLLMs (Nahin et al., 16 Feb 2025), highlight that the lack of deeper Bangla-specific vocabulary expansion in BongLLaMA leads to a high tokens-per-word (TPW) ratio (up to 7.84 TPW for Bangla text), resulting in longer input sequences and sub-optimal word segmentation compared to models with custom Bangla tokenizers.

2. Pretraining Corpora and Data Augmentation

The Bangla-specific pretraining corpus is drawn from the Bangla subset of CulturaX, which aggregates, cleans, and deduplicates Bangla web and newswire articles from sources such as mC4 and OSCAR. Preprocessing applied includes language filtering, HTML removal, normalization, and MinHash deduplication. The corpus comprises 12.4 million Bangla news articles, and a full epoch of continual pretraining is performed, yielding a token coverage on the order of 1–2 billion.

Data augmentation leverages synthetic instruction datasets generated by translating English instruction-tuning corpora (Alpaca and a subset of OpenOrca) into Bangla using the Google Translation API, followed by manual post-editing. The result is a 172,000 sample “Bangla-Alpaca-Orca” instruction set covering task types such as coding, translation, open-domain QA, text generation, literature, and ethics.

3. Fine-Tuning Regimens and Training Objectives

BongLLaMA’s continued pretraining employs the standard causal cross-entropy objective:

LCE=1Ni=1Nt=1Tilogpθ(xi,txi,<t)L_{CE} = -\frac{1}{N} \sum_{i=1}^N \sum_{t=1}^{T_i} \log p_\theta(x_{i,t} | x_{i,<t})

where sequences are drawn from the Bangla news corpus. During both pretraining and instruction-tuning phases, LoRA (Low-Rank Adaptation) modules are attached to all linear layers to enable efficient parameter-efficient adaptation. For LoRA, each weight matrix WW is updated via Weff=W0+BAW_{eff} = W_0 + BA with ARr×dA \in \mathbb{R}^{r \times d}, BRd×rB \in \mathbb{R}^{d \times r}, with rank rdr \ll d.

Key hyperparameters for pretraining and instruction-tuning include:

  • Steps: 10,000 (pretraining), 50,000–100,000 (instruction-tuning)
  • Optimizer: AdamW (8-bit), initial LR = 1e-4 (LLaMA-2) or 2e-4 (LLaMA-3), cosine decay
  • Precision: bf16 (automatic mixed precision)
  • Batch size: micro-batch 8, gradient accumulation to 512–1,024 tokens/GPU, effective batch ~2M tokens
  • Memory and speed: gradient checkpointing, sample packing, FlashAttention, 4-bit weights

Most training was performed on NVIDIA A100 GPUs, requiring ~40 GPU hours/epoch for 7B/8B and ~10 hours for 1B parameter models.

4. Evaluation Datasets and Protocols

BongLLaMA is evaluated on a suite of Bangla-centric instruction-following tasks using nine categories and 120 manually curated queries covering coding, translation, entertainment, generation, open-domain QA, factual QA, reasoning, ethics, and literature. For each query, three independent completions are sampled (temperature 0.6), and responses are scored by GPT-4o acting as an “omni-evaluator” on a scale of 1–100. Outlier assessments are cross-verified by human annotators for calibration.

TituLLMs (Nahin et al., 16 Feb 2025) further evaluates BongLLaMA-3.2-1B and 3B on five standardized Bangla benchmarks:

Task Metric Size (test/val)
Bangla MMLU Accuracy 14,750/72,944
BoolQ Bangla Accuracy 729/432
CommonsenseQA Accuracy 9,741/1,221
OpenBookQA Accuracy 4,947/500
PIQA Accuracy 15,339/1,838

These datasets involve multi-choice world knowledge, reading comprehension, and commonsense reasoning, with question normalization, lowercase conversion, and error filtering.

5. Results and Empirical Performance

Instruction-Following and Bangla-Centric Tasks

On the manually curated, GPT-4o-scored benchmark, BongLLaMA3-8B achieves strong results on several Bangla-centric tasks:

Task BongLLaMA3-8B Meta-LLaMA
Coding 32.50 68.33
Translation 57.92 37.50
Entertainment 43.96 33.38
Generation 66.46 57.92
Open_QA 66.96 43.40
Factual_QA 62.56 54.83
Reasoning 82.75 37.50
Ethics 16.58 32.25
Literature 41.85 36.33

These results indicate that BongLLaMA3-8B achieves significant improvements in Bangla reasoning (+45), open-QA (+23), and literature (+5), relative to Meta-LLaMA. Limitations are observed in coding and translation, where original LLaMA models benefit from technical and bilingual pretraining data.

Standardized Bangla Benchmarks

A direct comparison of BongLLaMA and the base LLaMA-3.2, as well as TituLLMs, shows that:

  • BongLLaMA’s normalized accuracy in Bangla MMLU (0.30–0.33) does not exceed that of multilingual LLaMA-3.2-3B (0.33–0.34).
  • Across CSQA, OBQA, and PIQA, BongLLaMA matches the base model in zero/few-shot settings, with minimal distinctive gains.
  • Performance on BoolQ is consistent across all models (~0.53).

TituLLMs demonstrates that using a larger and more diverse Bangla corpus (37B tokens vs. BongLLaMA's ~1–2B) and a Bangla-extended tokenizer can yield further improvements.

6. Analysis: Adaptation, Limitations, and Usage

Quantitative and qualitative results highlight the following properties:

  • Tokenizer Limitation: The lack of Bangla-specific tokens in LLaMA-3.2-based BongLLaMA results in high token fragmentation (5–10 sub-tokens per word for common Bangla words), impacting both efficiency and accuracy.
  • Corpus Coverage: The sole use of CulturaX news data restricts exposure to colloquial, literary, and scientific language variants, impairing domain adaptation.
  • Instruction-Tuning: Reliance on Google Translate for instruction data introduces translation artifacts; no evidence of back-translation or synthetic LLM-generated instruction augmentation was presented.
  • Hardware and Accessibility: Models are made available under MIT-style licensing on Hugging Face and are compatible with standard Hugging Face Transformers inference pipelines.

Example code for inference:

1
2
3
4
5
6
7
8
9
10
11
from transformers import LlamaForCausalLM, LlamaTokenizer
import torch

tokenizer = LlamaTokenizer.from_pretrained("zehady/BongLLaMA-3.2-3B")
model = LlamaForCausalLM.from_pretrained("zehady/BongLLaMA-3.2-3B",
                                         torch_dtype=torch.float16,
                                         device_map="auto")
prompt = "বাংলার স্বাধীনতা দিবস সম্পর্কে সংক্ষিপ্ত বিবরণ লিখুন।"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Expected output: A concise history of Bangladesh's Independence Day in Bangla.

7. Significance, Impact, and Future Directions

BongLLaMA is the first open-source, instruction-tuned LLaMA variant tailored for Bangla, closing critical performance gaps on generative, reasoning, and cultural tasks in Bangla language processing. Nevertheless, limitations remain in coding, translation, and domain coverage, primarily due to constrained pretraining data sources and the absence of a dedicated Bangla tokenizer.

Potential directions for improvement include:

  • Expanding the Bangla tokenizer vocabulary (as seen in TituLLMs, up to 96K entries) to better capture morphological complexity.
  • Incorporating additional bilingual, literary, conversational, and technical corpora for broader domain coverage.
  • Exploring self-supervised augmentation strategies (back-translation, LLM-based paraphrasing) for instruction data.
  • Increasing context window size and introducing task-conditioned decoding.
  • Extending Bangla LLM capabilities to multi-modal domains (speech, OCR).

In comparative perspective (Nahin et al., 16 Feb 2025), language adaptation for low-resource languages such as Bangla is most effectively driven by large-scale, diverse data and fine-grained tokenizer customizations. BongLLaMA, by establishing an open-access, Bangla-centric baseline, has set a foundation for ongoing and future research in resource-lean language modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to BongLLaMA.