Automatic Mixed Precision (AMP) Training

Updated 7 December 2025

AMP training is a method that combines lower-precision (e.g., FP16) and full-precision (FP32) operations to speed up model training while maintaining stability.
It reduces memory footprint and computational costs, enabling efficient scaling of deep learning architectures without significant accuracy degradation.
AMP has become a standard practice in modern deep learning pipelines, enhancing training speed and energy efficiency across various hardware platforms.

BanglaBERT is a family of monolingual transformer-based LLMs pre-trained specifically for Bengali (Bangla), a widely spoken yet previously low-resource language in NLP. Introduced by Bhattacharjee et al. (2022), BanglaBERT leverages large-scale, in-language corpora and the ELECTRA-style replaced token detection objective to achieve strong, efficient, and robust representation learning. Through subsequent adoption and extensive benchmarking, BanglaBERT has achieved state-of-the-art results across varied tasks such as sentiment analysis, question answering, named entity recognition, hate speech detection, and ASR transcript disfluency classification.

1. Model Architecture and Pre-training

BanglaBERT is primarily realized as a pair of ELECTRA-style transformer encoders: BanglaBERT-Base (12 layers, hidden size 768, 12 self-attention heads, ~110M parameters) and BanglaBERT-Large (24 layers, hidden size 1024, 16 heads) (Bhattacharjee et al., 2021, Chakma et al., 2023). Pre-training utilizes the ELECTRA objective: a small generator (trained with MLM) produces substituted tokens, while the main model—a discriminator—predicts for each token whether it is original or replaced. The overall loss function is

$L = L_\mathrm{Generator} + \lambda L_\mathrm{Discriminator}$

where, for the masked positions $M$ ,

$L_\mathrm{Generator} = -\sum_{i \in M} \log P_{G}(x_i | x_{\setminus M})$

and

$L_\mathrm{Discriminator} = -\sum_{i} \left[ I(x_i \mathrm{\ was\ replaced}) \log D(\tilde{x}_i) + (1 - I(x_i \mathrm{\ was\ replaced})) \log (1 - D(\tilde{x}_i)) \right]$

BanglaBERT's vocabulary is a 32,000-token WordPiece model, supporting both native Bangla script and romanized or code-mixed inputs (Bhattacharjee et al., 2021). The maximum sequence length is set to 512 during pre-training, and positional embeddings are learned up to this limit (Khondoker et al., 24 Dec 2024).

The core dataset for pre-training, “Bangla2B+,” comprises 27.5GB of deduplicated text crawled from 110 popular Bengali websites, spanning news, encyclopedic, literary, and conversational genres (Bhattacharjee et al., 2021). The tokenization pipeline includes language filtering, subword vocabulary induction, and character normalization steps to ensure broad coverage of Bangla linguistic phenomena.

2. Pre-training and Fine-tuning Methodologies

Pre-training employs the ELECTRA replaced token detection (RTD) variant for improved sample efficiency and parameter utilization relative to standard masked language modeling (Bhattacharjee et al., 2021, Faria et al., 28 Jul 2024). Pre-training hyperparameters include batch sizes of 256–512 sequences, Adam(W) optimizer with peak learning rates of 1–2×10⁻⁴, and 2.5M steps with a linear warmup and decay schedule. Training was conducted on Google TPU v3-8s.

For fine-tuning, BanglaBERT exposes the [CLS] token's final hidden state as a sequence summary for classification or regression heads. On a wide range of Bengali NLP tasks, the following design choices and settings recur:

Learning rates: 1–2×10⁻⁵ for most classification tasks, with a range of 1×10⁻⁵ to 1×10⁻² explored in extensive hyperparameter sweeps (Chowdhury et al., 14 Jan 2024, Chakma et al., 2023, Hossen et al., 15 Jul 2025)
Batch sizes: typically 8–64, with 16 or 32 common for moderate-resource tasks (Khondoker et al., 24 Dec 2024, Hossen et al., 15 Jul 2025, Jafari et al., 2 Dec 2025)
Epochs: 2–20, depending on task and early stopping on dev sets
Optimizer: AdamW, often with weight decay of 0.01
Dropout: 0.1 in classification heads and sub-layers for regularization
Grad clipping and dynamic LR schedules used as needed

Fine-tuning incorporates a mix of:

Task-specific linear heads for classification (single-layer, mapping CLS to logits)
External lexicon-based preprocessing, e.g., via BSPS (Mahmud et al., 29 Nov 2024)
Feature-extraction only (frozen transformer) for hybrid ensemble models (Hossen et al., 15 Jul 2025)
Data augmentation and multi-stage adaptation via external corpora (Chakma et al., 2023)

3. Applications Across Bengali Language Understanding Tasks

BanglaBERT consistently sets or approaches state-of-the-art in several domains:

A. Sentiment Analysis

Used with cross-entropy loss for multi-class classification, BanglaBERT achieves macro-F₁ >0.70 on SentNoB (BLUB) (Bhattacharjee et al., 2021), up to 0.718 on BLP-2023 with ensemble variants (Chakma et al., 2023), and binary F₁ = 0.8780 on political sentiment in the Motamot dataset (Faria et al., 28 Jul 2024). Hybrid models combining rule-based outputs (BSPS) and BanglaBERT further increase binary sentiment accuracy to 89% (Mahmud et al., 29 Nov 2024). Deep transformer feature concatenation with PCA and voting classification (XMB-BERT) further improves accuracy to 83.7% on social media texts (Hossen et al., 15 Jul 2025).

B. Hate Speech Detection

On the BLP 2025 Shared Task, BanglaBERT achieves micro-F₁ = 0.70 (Subtask 1A: hate type, 6 classes) and 0.68 (Subtask 1B: hate target, 5 classes), outperforming m-BERT and XLM-RoBERTa despite a smaller parameter count (Jafari et al., 2 Dec 2025).

C. Question Answering

When fine-tuned on the NCTB QA dataset (~3,000 passage–question–answer triples), Bangla-BERT attains F₁ = 0.75 and EM = 0.53, surpassing both Bengali and multilingual baselines (Khondoker et al., 24 Dec 2024). Superior handling of long-range context and stop-word retention are empirically highlighted.

D. Depressive Text Classification

On the Bengali Social Media Depressive Dataset (BSMDD), BanglaBERT achieves F₁ = 0.8625, accuracy 0.8604 under its best configuration. The model’s recall advantage is linked to monolingual ELECTRA pretraining, albeit with some residual weaknesses in precision and nuanced colloquial understanding (Chowdhury et al., 14 Jan 2024).

E. ASR Transcript Disfluency

BanglaBERT (fine-tuned) reaches 84.78% accuracy and F₁ = 0.677 in distinguishing repetition disfluency from morphological reduplication in noisy transcripts. It decisively outperforms mBERT/XLM-R (F₁ ≈ 0.566) and few-shot LLMs (Arpa et al., 17 Nov 2025).

4. Comparative Performance and Efficiency

Evaluation on the BLUB benchmark—which aggregates sentiment, NLI, NER, and QA tasks—shows BanglaBERT with a composite BLUB score of 77.09, outperforming XLM-R Large by 0.3 points despite the latter’s fivefold larger parameter count (Bhattacharjee et al., 2021). BanglaBERT also demonstrates stronger sample efficiency: under low-resource (≤1,000 samples) regimes, it exceeds XLM-R Large by 2–9pp Macro-F₁ in sentiment and 6–10pp accuracy in NLI.

Resource-wise, BanglaBERT is computationally efficient. Fine-tuning time and VRAM usage, measured against several multilingual baselines, consistently favor BanglaBERT (up to 4.5× less for XLM-R Large) (Bhattacharjee et al., 2021).

Selected Task Performance Table:

Task/Benchmark	Score	Evaluation Metric	Reference
BLUB (composite)	77.09	Composite (Macro F₁, EM)	(Bhattacharjee et al., 2021)
Sentiment (Motamot)	0.8810	Accuracy	(Faria et al., 28 Jul 2024)
Hate Speech (1A)	0.70	Micro-F₁ (6-class)	(Jafari et al., 2 Dec 2025)
NCTB QA	0.75	F₁ (Span-level)	(Khondoker et al., 24 Dec 2024)
Depressive Text	0.8625	F₁ (Binary)	(Chowdhury et al., 14 Jan 2024)
Reduplication/Disfl.	0.677	F₁ (3-way)	(Arpa et al., 17 Nov 2025)

5. Analysis: Strengths, Limitations, and Design Choices

Strengths:

Monolingual ELECTRA-based pre-training captures Bangla morphosyntax, idiomatic expressions, and domain-specific word sense more robustly than multilingual encoders (Faria et al., 28 Jul 2024, Hossen et al., 15 Jul 2025, Jafari et al., 2 Dec 2025).
ELECTRA’s RTD objective delivers higher training efficiency by exposing the discriminator to all tokens, including synthetically replaced ones.
Performance gains are evident even on small, fine-grained, and out-of-domain tasks, with especially strong recall in high class-imbalance and low-resource regimes (Arpa et al., 17 Nov 2025, Chowdhury et al., 14 Jan 2024).
Downstream efficiency advantages: faster fine-tuning, smaller VRAM required, and consistent results even with minimal hyperparameter sweeps (Bhattacharjee et al., 2021).

Limitations:

Precision lags recall in some scenarios (e.g., depression, negative sentiment), reflecting moderate tendency for over-prediction in ambiguous cases (Chowdhury et al., 14 Jan 2024, Mahmud et al., 29 Nov 2024).
Handling subtle distinctions, such as distinguishing abusive/profane hate speech versus other types, remains challenging (Jafari et al., 2 Dec 2025).
Datasets such as NCTB QA (~3,000 examples) and Motamot are modest in scale, limiting generalization and requiring further augmentation (Khondoker et al., 24 Dec 2024, Faria et al., 28 Jul 2024).
Spelling inconsistencies, code-mixing, and noncanonical orthography in inputs can degrade performance, motivating spelling-invariant or character-level representations (Khondoker et al., 24 Dec 2024).
No reported results for larger model sizes (i.e., BanglaBERT-Large) on most downstream tasks as constrained by computational budgets (Chakma et al., 2023).

6. Future Directions and Open Challenges

Proposed enhancements and research focuses include:

Expanded pre-training on domain- and style-specific corpora (e.g., educational texts, social media, mixed-script corpora) (Khondoker et al., 24 Dec 2024, Chowdhury et al., 14 Jan 2024).
Task-adaptive or domain-adaptive pre-training (TAPT/DAPT), though initial results suggest overfitting risks on small in-domain data (Chakma et al., 2023).
Integration of spelling-invariant and byte-level subword representations (Khondoker et al., 24 Dec 2024).
Refined classification heads for multi-task, low-resource, or fine-grained objectives (e.g., specialized question typology or multi-head emotion analysis).
Robustness to code-mixed, colloquial, and rare linguistic forms through data augmentation strategies (LLM-generated synthetic examples, contrastive learning) (Arpa et al., 17 Nov 2025).
Ensemble or hybrid strategies (e.g., lexicon + transformer fusion, frozen feature concatenation) to maximize coverage and precision (Mahmud et al., 29 Nov 2024, Hossen et al., 15 Jul 2025).
Auditing for fairness, bias, and equitable performance across regional, gender, and sociolectal subgroups in Bangla (Jafari et al., 2 Dec 2025).
Improved error analysis, and the adoption of statistical significance tests for robust benchmarking.

BanglaBERT, through a combination of language-specific pre-training, efficient and expressive transformer architectures, and continual benchmarking, provides a canonical backbone for Bengali language understanding. Further advances in data coverage, hybrid learning strategies, and model scaling are likely to extend its applicability across domains in low-resource NLP.