BanglaBERT: Transformer for Bangla NLP

Updated 7 December 2025

BanglaBERT is a transformer model specialized for Bangla that leverages ELECTRA’s replaced token detection for efficient pretraining.
The architecture features a two-tower setup with 12 layers and approximately 110M parameters, addressing tasks like classification, sequence labeling, and question answering.
BanglaBERT outperforms multilingual baselines on the Bangla Language Understanding Benchmark (BLUB), showcasing improved sample efficiency and downstream transfer.

BanglaBERT is a pre-trained transformer-based LLM specifically developed for Bangla (Bengali), a major South Asian language considered low-resource in mainstream natural language processing. Designed for high-coverage Bangla language understanding, BanglaBERT leverages recent architectural advances, comprehensive web-scale corpora, and systematic downstream benchmarks to establish state-of-the-art performance across classification, sequence labeling, and span-based tasks.

1. Model Architecture

BanglaBERT’s architecture is grounded in the BERT-Base paradigm but incorporates significant design choices for improved pretraining efficiency and adaptability to the linguistic attributes of Bangla.

Core encoder: 12 transformer layers; each with a hidden size of 768, 12 self-attention heads, and an intermediate (feed-forward) layer size of 3072; totaling ≈110 million parameters for the discriminator.
Generator-discriminator “two-tower” setup: Adopts ELECTRA’s architecture, where a smaller generator (~43M parameters) is trained as a masked LLM, while a larger discriminator (~110M) is optimized for replaced token detection via a binary classification head per token position.
Tokenization: 32,000 subword WordPiece vocabulary, jointly trained over a 400-character alphabet that covers both native Bangla and romanized code-switch variants.
Sequence truncation: Max sequence length of 512; sequences are not permitted to cross document boundaries.

The primary distinction from vanilla BERT-Base lies in replacing the standard masked language modeling (MLM) and next sentence prediction (NSP) objectives with ELECTRA’s replaced token detection, yielding improved sample efficiency and reduced pretraining wall-clock by approximately a quarter relative to RoBERTa (Bhattacharjee et al., 2021).

2. Pretraining Corpus and Data Preparation

BanglaBERT pretraining draws on “Bangla2B+,” the largest curated Bangla corpus to date:

Source: Raw web scrape of approximately 35 GB from 110 high-traffic Bangla domains, encompassing encyclopedias, news portals, blogs, e-books, story collections, and social/social-media forums.
Cleaning pipeline: Language filtering, document deduplication, and HTML markup removal, producing 27.5 GB of pure Bangla text, 5.25 million documents (mean length ≈306 words/doc).
Tokenization: WordPiece vocabulary accommodates extensive code-switching and various scripts used in colloquial Bangla, crucial for both formal and user-generated text.
Sample creation: After pre-processing, 7.18 million sequences (each ≤512 tokens), resulting in 2.18 billion total training tokens (Bhattacharjee et al., 2021).

3. Pretraining Objectives and Training Regimen

Training Objectives

Generator: Standard masked language modeling (MLM), minimizing

$\mathcal{L}_{\mathrm{MLM}} = -\, \mathbb{E}_{X}\, \sum_{t \in M} \log P_\theta(x_t\, |\, X_{\setminus M})$

where $M$ denotes masked positions in sequence $X$ .

Discriminator: Replaced Token Detection, training the model to distinguish “original” vs. “replaced” tokens according to

$\mathcal{L}_D = -\, \mathbb{E}_X\, \sum_{i=1}^L [\, d_i \log D_\phi(x_i) + (1-d_i)\log(1-D_\phi(x_i))\, ]$

where $d_i=1$ if token $i$ is original, $0$ otherwise.

Total loss:

$\mathcal{L}_{\mathrm{ELECTRA}} = \mathcal{L}_G + \mathcal{L}_D$

(with $\mathcal{L}_G$ the generator’s MLM loss).

Training Hyperparameters

Optimizer: Adam ( $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=10^{-6}$ ).
Peak learning rate: $2 \times 10^{-4}$ , linearly warmed up over first 10,000 steps.
Batch size: 256 sequences per step.
Total steps: 2.5 million.
Hardware: Google Cloud TPU v3-8.

The electra-style objective delivers improved sample efficiency and downstream transfer, especially in low-resource regimes (Bhattacharjee et al., 2021).

4. Downstream Evaluation and Benchmarks

To operationalize BanglaBERT’s capabilities, the researchers introduced the Bangla Language Understanding Benchmark (BLUB), covering four NLU task types:

Task Type	Dataset	Train	Dev	Test	Metric
Sentiment Classification	SentNoB	12,575	1,567	1,567	Macro-F1
Natural Language Inference	BNLI	381,449	2,419	4,895	Accuracy
Named Entity Recognition	MultiCoNER	14,500	800	—	Micro-F1
Question Answering	BQA/TyDiQA-BN	127,771	2,502	2,504	EM / F1 (spans)

SentNoB: Social media sentiment.
BNLI: Automatic translation + human validation from English NLI tasks.
MultiCoNER: Diverse-domain NER.
BQA/TyDiQA-Bangla: Combination of machine-translated SQuAD 2.0 and native Bangla QA.

BanglaBERT achieved BLUB scores of 77.09 (averaged across four tasks), outperforming comparably large multilingual and monolingual baselines: mBERT (70.29), XLM-R base (72.82), and sahajBERT (71.03). On individual tasks, BanglaBERT’s macro-F1, accuracy, and EM/F1 scores were consistently state-of-the-art (Bhattacharjee et al., 2021).

5. Applied Research and Task-Specific Fine-Tuning

Fake News Detection

Dataset construction: Synthesis from authentic (BanFakeNews) and machine-translated fake articles (TransFND), plus augmentation via back-translation, paraphrase generation, and MLM token replacement.
Model setup: 12-layer, 768-dim transformer; binary softmax classifier over [CLS] embedding; trained for 4 epochs with AdamW and a 2e−5 learning rate.
Results: BanglaBERT achieved up to 96% accuracy on augmented datasets, 97% when combined with summarization, outperforming mBERT (86% on held-out generalization). Only augmentation (without summarization) gave the highest single-task improvement, correcting class imbalance and enhancing robust feature representations (Chowdhury et al., 2023).

Sentiment Analysis

Hybrid system: Combination of rule-based Bangla Sentiment Polarity Score (BSPS) and BanglaBERT.
Nine-class sentiment: After lexicon-based fine-grained scoring, BSPS+BanglaBERT hybrid delivered 89% accuracy (vs. 79% for BanglaBERT only); also reduced confusion among adjacent classes, especially for nuanced polarities (Mahmud et al., 29 Nov 2024).

Reduplication vs. Repetition Disfluency

ASR transcript normalization: Fine-tuning on a purpose-built 20,000-row corpus allowed BanglaBERT to distinguish morphological reduplication (grammatical, ~66%) from repetition disfluency (~33%).
Performance: Accuracy 84.78%, F1 0.677, exceeding few-shot LLMs (Claude 4: 82.68% accuracy). The fine-tuned model demonstrates superior precision (0.901) in safeguarding genuine reduplication, reflecting the advantage of language-specific unsupervised pretraining (Arpa et al., 17 Nov 2025).

Hyperpartisan News Detection

Semi-supervised strategy: BanglaBERT fine-tuned on 3,200 annotated articles, then extended via high-confidence pseudo-labels on 47,987 unlabeled samples.
Results: Test accuracy 95.65%, F1 95.44%. Outperforms traditional ML baselines (best: Random Forest, F1=93.55%) and previous cross-lingual transformer scores. LIME-based local explanations highlight salient tokens driving model decisions (Hasan et al., 28 Jul 2025).

Question Answering

Passage-based QA (NCTB textbooks): On a 3,000-example extractive QA task, BanglaBERT (F1=0.75, EM=0.53) consistently outperformed BERT-Base and RoBERTa-Base by wide margins. Best results were sensitive to small batch sizes (16/8), inclusion of stop words, and a moderately high learning rate (2 × 10⁻⁴) (Khondoker et al., 24 Dec 2024).

6. Efficiency, Transfer, and Analysis

Sample efficiency: BanglaBERT outperformed XLM-R (large) by 2–9 points (macro-F1) on sentiment and 6–10 points (accuracy) on NLI under low-sample (≤1,000) training.
Compute/memory: Fine-tuning BanglaBERT required less time and memory (relative time/memory = 1.00) than mBERT (1.14–1.92× time), XLM-R base (1.29–1.81×), and XLM-R large (3.81–4.49×).
Zero-shot cross-lingual transfer: While cross-lingual models like XLM-R large achieved up to BLUB 66.59, BanglaBERT reached BLUB 77.09 with supervised Bangla data.
Error and limitations: No major systematic failure modes, but performance is sensitive to resource quality (e.g., machine translation artifacts in fake news synthesis, coverage in lexicons for sentiment), and some tasks (e.g., QA, non-extractive) remain challenging (Bhattacharjee et al., 2021, Chowdhury et al., 2023, Mahmud et al., 29 Nov 2024).

7. Release, Impact, and Prospects

BanglaBERT and its associated corpora, evaluation datasets, and leaderboards are publicly available, catalyzing research in Bangla NLP through shared benchmarks and reproducible pipelines (Bhattacharjee et al., 2021). Downstream studies have demonstrated the utility of BanglaBERT for broad NLU tasks ranging from text normalization in ASR to explainable hate speech and bias detection (Chowdhury et al., 2023, Hasan et al., 28 Jul 2025, Khondoker et al., 24 Dec 2024). Future research directions include:

Extending BLUB: Incorporating syntactic and semantic tasks (e.g., dependency parsing, NLG).
Cross-lingual and semi-supervised approaches: Leveraging transfer from related Indic languages and self-training to further mitigate low-resource constraints.
Explainable and robust NLP: Integrating model explainability (LIME, SHAP) and data augmentation to address generalization challenges.
Multimodal and domain-adaptive applications: Applying BanglaBERT in settings integrating multimodal signals or domain-specific language.

BanglaBERT thus represents a technically robust and extensible foundation for advancing Bangla language processing at both the model and ecosystem levels, with ongoing refinement of its architecture, objectives, and downstream evaluation modalities (Bhattacharjee et al., 2021, Chowdhury et al., 2023, Mahmud et al., 29 Nov 2024, Arpa et al., 17 Nov 2025, Hasan et al., 28 Jul 2025, Khondoker et al., 24 Dec 2024).