BanglaT5-small Transformer Model
- BanglaT5-small is a Transformer-based language model designed for Bangla NLP, featuring 6 encoder and 6 decoder layers with relative positional embeddings.
- It employs a SentencePiece tokenizer with a 32K subword vocabulary to effectively manage Bangla’s agglutinative morphology and reduce out-of-vocabulary issues.
- Pretrained on 27.5GB of curated Bangla text using a span corruption objective, it achieves state-of-the-art performance in translation, summarization, QA, dialogue, and grammatical error detection.
BanglaT5-small is a Transformer-based encoder–decoder LLM developed for NLP tasks in Bangla (Bengali), addressing the language's low-resource regime and complex morphology. The model underlies several state-of-the-art results across generative and classification tasks, forming the backbone of the BanglaNLG benchmark. Its architecture, tokenizer configuration, pretraining data, and empirical performance have positioned it as a leading resource for conditional text generation and language understanding in Bangla (Bhattacharjee et al., 2022, Shahgir et al., 2023).
1. Model Architecture
BanglaT5-small follows the text-to-text transfer Transformer (T5) framework with modifications for model scale and domain adaptation. The “small” variant, as explicitly detailed in grammatical error detection tasks, comprises 6 encoder and 6 decoder layers, with a model dimension () of 512 and a feed-forward dimension () of 2048. Each layer deploys 8 attention heads and relative positional embeddings, and block-layer normalization is applied at the block start (Shahgir et al., 2023).
The larger “base” variant described for the BanglaNLG benchmark uses a deeper 12-layer encoder and 12-layer decoder, , , and 12 attention heads per layer (Bhattacharjee et al., 2022). Gated linear units with GeLU activation (GeGLU) are implemented in the feed-forward sublayers, combined with pre-norm LayerNorm on every sublayer.
In both settings, the architecture encodes and decodes tokenized Bangla text, with the encoder mapping masked input sequences and the decoder predicting masked spans, sharing the paradigm standard in T5 models.
2. Tokenization and Vocabulary
BanglaT5-small employs a SentencePiece unigram LLM as its tokenizer, constructing a fixed vocabulary of 32,000 subword tokens (Bhattacharjee et al., 2022). The tokenizer is optimized for 0.99995 character coverage over 27.5 GB of normalized Bangla text, enabling robust segmentation across compound words, inflections, and rare morphemes.
The model thus handles Bangla’s agglutinative morphology within the confines of subword representations, reducing the out-of-vocabulary (OOV) problem and supporting transfer across lexical and syntactic variants.
3. Pretraining Corpus and Objective
The pretraining corpus, designated “Bangla2B+,” aggregates 27.5 GB of raw Bangla text curated from vetted news domains, Wikipedia exports, and public-domain literature. Careful exclusion of noisy web sources (e.g., CCNet, mC4) is applied to minimize offensive or low-quality pretraining samples (Bhattacharjee et al., 2022).
The pretraining task utilizes “span corruption,” where input sequences undergo stochastic masking of contiguous token spans, replaced with distinct sentinel tokens such as <extra_id_0>. The encoder receives the masked sequence, while the decoder predicts the concatenation of the masked spans in order, each prefixed by the corresponding sentinel. The loss function is standard cross-entropy over the decoder’s output token sequence, formalized as
Pretraining is performed for 3 million steps on TPU v3-8 hardware, with a batch size of 65,536 tokens per step, Adam optimizer with linear warm-up, and inverse square-root learning rate decay (Bhattacharjee et al., 2022). The small variant (for GED tasks) used 120 epochs, AdamW optimizer, batch size 128, and learning rate (Shahgir et al., 2023).
4. Task-Specific Fine-Tuning and Post-Processing
Fine-tuning protocols adapt BanglaT5-small to downstream tasks including machine translation, summarization, question answering, dialogue, headline generation, and grammatical error detection (GED). For GED, fine-tuning is conducted on a corpus with errors annotated by surrounding them with dollar signs (“bhalo na$ boli,” with the standard cross-entropy objective (Shahgir et al., 2023).
Predictions on such tasks require sophisticated post-processing to address the model’s tendency for fluent paraphrasing or token reordering. The GED pipeline applies:
- Character-level correction: aligns inputs and outputs at the character level, using a mapping table to recover input tokens for non-matching or bracketed outputs;
- Word-level correction: corrects entire words partially mismatched, with a lookup for common error forms;
- Regular expression rules: handles missing or complex error categories with handcrafted patterns;
- Training set lookup: ensures exact matches produce gold-standard annotated outputs.
Such correction increases the fidelity of error localization, with final performance measured by average Levenshtein Distance between predicted and gold-standard bracketed sentences.
5. Empirical Performance
On the BanglaNLG tasks (Bhattacharjee et al., 2022), BanglaT5-small (“Base” scale) achieves the following reported metrics:
| Task | BanglaT5-small Score | Strongest Baseline |
|---|---|---|
| Machine Translation | 31.3 / 17.4 SacreBLEU | mT5-Base: 30.1 / 17.2 |
| Abstractive Summarization | 13.7 ROUGE-2 | mBART-50: 10.4 |
| QA (EM / F1) | 68.5 / 74.8 | IndicBART unified: 59.6/65.6 |
| Multi-turn Dialogue (BLEU-1) | 19.0 | XLM-ProphetNet: 20.0 |
| Headline Generation (ROUGE-2) | 13.8 | mBART-50: 11.2 |
| CL Summarization (ROUGE-2) | 6.4 / 4.0 | XLM-ProphetNet: 6.2/2.7 |
On GED, post-processing achieves a final Levenshtein Distance of 1.0394 on a 5,000-sentence test set, compared to the raw model output LD of 3.212. Precision/recall/F1 are unreported, but the method demonstrates strong alignment with gold error spans after correction (Shahgir et al., 2023).
6. Implementation, Reusability, and Release
The BanglaT5-small and related resources are available under a non-commercial CC BY-NC-SA 4.0 license at https://github.com/csebuetnlp/BanglaNLG. The release includes model checkpoints, code, and datasets, as well as recipes for fine-tuning on multiple conditional text generation tasks (Bhattacharjee et al., 2022). The GED approach is readily adaptable: with annotated corpora, character/word-specific lookup mapping, and regular expression adjustment, the pipeline extends to other languages with minimal architectural change (Shahgir et al., 2023). Caution is advised regarding residual data bias and model hallucinations.
7. Significance and Context in Bangla NLP
BanglaT5-small sets a benchmark in Bangla low-resource NLG, outperforming or matching multilingually pretrained models (e.g., mT5, mBART-50) by up to 9 points and 32% relative gain on specific tasks (Bhattacharjee et al., 2022). The model supports research in translation, summarization, QA, dialogue generation, and error detection. A plausible implication is that morphologically rich and low-resource languages benefit from large clean monolingual corpora, span-masking objectives, and domain-adapted models rather than relying solely on massive multilingual pretraining. BanglaT5-small’s architecture, data pipeline, and collective benchmarks thus anchor ongoing efforts in resource construction and transfer learning for Bangla and typologically similar languages.