Bengali Text Classification Overview
- Bengali text classification is the process of assigning structured labels to unstructured Bengali texts, addressing challenges like resource scarcity, morphological richness, and code-mixing.
- It leverages diverse methods from classical TF-IDF and SVM approaches to deep learning and transformer-based models, achieving significant performance improvements.
- Research employs innovative annotation schemes, feature engineering techniques, and ensemble strategies to mitigate class imbalance and enhance model robustness.
Bengali Text Classification is the task of assigning structured labels (topic, sentiment, style, intent, emotion, etc.) to unstructured Bengali-language texts, including documents, sentences, or utterances, under resource and morphological challenges characteristic of the language. The field spans classical supervised learning, deep learning, and recent advances in self-supervised and generative modeling, as applied to newspaper categorization, opinion mining, code-mixed sentiment, intent detection, authorship attribution, and communal-violence identification.
1. Datasets and Annotation Paradigms
Bengali text classification research has established a range of datasets differing in size, source, domain, and label structure. Major contributions include:
- News Classification: Potrika (664,880 manually labeled articles, 8 classes) (Ahmad et al., 2022), BAN-ABSA headlines dataset (9,014 headlines, each with topic and sentiment labels) (Raquib et al., 23 Nov 2025), Kaggle Prothom Alo (437,948 articles, 9 classes analyzed) (Hoque et al., 17 Jan 2026), and the Seven-Class corpus (212,184 news/blog articles) (Rafi-Ur-Rashid et al., 2023).
- Sentiment and Emotion: User reviews from Daraz (15,194 manually reviewed entries, mapped to 2 or 9 sentiment classes) (Mahmud et al., 2024), UBMEC for six-way emotion (13,072 samples) (Sourav et al., 2022), five public Bangla sentiment datasets (SAIL, ABSA, BengFastText, YouTube, CogniSenti) (Hasan et al., 2020), code-mixed BnSentMix (20,000 annotated social/e-commerce texts, 4 sentiment classes) (Alam et al., 2024).
- Intent and Question Classification: BNIntent30 (4,433 utterances, 30 intent classes) (Hasan et al., 2023), two-stage question typing pipeline (Rahman et al., 2019).
- Document and Hate Speech: Large genre-diverse document collections (376k+ for document classification, 35k for hate speech) (Karim et al., 2020), communal violence detection (Tasnim et al. 12,791 curated instances) (Khondoker et al., 24 Jun 2025).
- Stylometry and Authorship: 450 balanced literary stories for three-way authorship (Chakraborty, 2012), 90 samples for fine-grained stylometry (Chakraborty et al., 2012).
Annotation strategies vary from manual majority voting (Potrika, BAN-ABSA, Daraz) to machine-guided labeling (LDA topic induction (Ahmad et al., 2022)), code-mixed filtering with mBERT for code-switch detection (Alam et al., 2024), and expert panel refinement for subjective or subjective/violent content (Khondoker et al., 24 Jun 2025). Notable is the use of multi-label classification for noisy labels and the prevalence of stratified splits to mitigate class imbalance.
2. Feature Engineering and Embedding Approaches
The progression from lexical to distributed semantic representations in Bengali classification mirrors state-of-the-art multilingual NLP:
- Bag of Words and TF–IDF: Classical frequency-based vectors and term weighting remain competitive; SVM on normalized TF–IDF yields up to 92.6% F1 on 12-class news categorization (Islam et al., 2017), with similar effectiveness in web document classification (Mandal et al., 2014).
- Distributional Embeddings: Word2Vec and GloVe support deeper learning models; doc2vec is effective for ML baselines (Ahmad et al., 2022). FastText—particularly BengFastText, trained on 250M Bengali articles—outperforms static embeddings on both classification and analogy tasks (Karim et al., 2020).
- Sequence and Subword Models: Trainable embedding layers (128 to 300-dimensions) are central for large corpora (Rafi-Ur-Rashid et al., 2023). Subword tokenization underlies transformer input, mitigating OOV and morphological sparsity (Alam et al., 2020), essential for code-mixed and morphologically complex content (Alam et al., 2024).
- Hybrid and Lexicon Features: Domain-adapted lexica, such as the 1,500-word sentiment polarity dictionary, produce rule-based pseudo-labels that, when fused with neural models, enhance performance on nuanced sentiment categories (Mahmud et al., 2024).
3. Model Architectures and Optimization Strategies
A spectrum of model architectures has been systematically benchmarked and extended to Bengali:
- Classical Learners: SVMs, logistic regression, SGD classifiers, random forests, and k-nearest neighbors, with doc2vec or TF–IDF features, remain robust for high-dimensional structured text (Islam et al., 2017, Mandal et al., 2014).
- Deep Neural Networks: CNNs, (Bi)GRUs, and (Bi)LSTMs are dominant for sequential modeling, typically initialized with pre-trained embeddings (Ahmad et al., 2022, Hasan et al., 2020, Karim et al., 2020). Hybrid frameworks (e.g., CNN + BiLSTM + Attention) outperform single-branch baselines for headlines and sentiment (Raquib et al., 23 Nov 2025).
- Generative Models: Deep generative approaches (LSTM-VAE, AC-GAN, AAE) provide compressed, discriminative latent spaces for classification tasks, with AAE features yielding ≈98.4% F1 in seven-way news discrimination and approaching BERT-level performance at a fraction of the vector size (Rafi-Ur-Rashid et al., 2023).
- Transformers and Pretrained LLMs: mBERT, XLM-RoBERTa, BanglaBERT, and Qwen variants set new performance ceilings in all major subfields. Monolingual BanglaBERT and multilingual XLM-RoBERTa consistently surpass earlier models by 5–29% accuracy across tasks (Alam et al., 2020, Khondoker et al., 24 Jun 2025). LLMs with instruction-tuning, LoRA adapters, and 4-bit QLoRA quantization (e.g., Qwen 2.5-7B, LLaMA 3.x) have achieved up to 72% accuracy and F1≈74.2% on balanced nine-class news categorization (Hoque et al., 17 Jan 2026).
- Adversarial and Semi-supervised Learning: GAN-augmented BERT (GAN-BnBERT) introduces a generator-discriminator structure that modestly but consistently improves intent classification (Δaccuracy +0.68pp over standard BERT) and smoother convergence (Hasan et al., 2023).
4. Task-Specific Adaptations and Multilingual Challenges
Bengali text classification confronts low-resource constraints, morphological complexity, code-mixing, and class imbalance:
- Code-mixed Sentiment: BnSentMix demonstrates that fine-tuned transformers (English BERT, mBERT, XLM-RoBERTa, BanglaBERT) perform on par (Acc/F1 ≈ 69.8%) for code-mixed Bengali–English sentiment, with "mixed" labels being hardest to accurately identify (Alam et al., 2024).
- Fine-grained and Multi-label Tasks: Multi-label and hierarchical variants (LDA+KNN for topic, hybrid lexicon-BERT cascades for nine-class sentiment (Mahmud et al., 2024)) address nuanced opinions, aspect-based sentiment, and ambiguous utterances.
- Commonsense and Contextual Cues: Integrating explicit linguistic knowledge (e.g., Bengali WordNet-based sense definitions (Pal et al., 2015)) or detailed stylometric markers (n=76 in stylometry (Chakraborty et al., 2012)) improves disambiguation tasks.
- Mitigating Imbalanced and Noisy Data: Undersampling/oversampling, weighted loss, and paraphrastic augmentation reduce bias towards majority classes (Raquib et al., 23 Nov 2025, Khondoker et al., 24 Jun 2025), with ensemble voting further enhancing stability and generalizability.
5. Performance Benchmarks and Comparative Analysis
Empirical results across major benchmarks consistently establish the superiority of transformer-based architectures, but also provide strong baselines from classical and hybrid models. Illustrative summary:
| Task/Corpus | Best Model | Accuracy (%) | Macro-F1 (%) | Reference |
|---|---|---|---|---|
| 9-class News (Kaggle Prothom Alo, balanced) | Qwen 2.5-7B + QLoRA | 72.0 | ~74.2 | (Hoque et al., 17 Jan 2026) |
| 8-class News (Potrika) | GRU + FastText | 91.83 | ~90 | (Ahmad et al., 2022) |
| 7-class News (Fang et al.) | AAE (32-dim), BERT (768-dim) | 98.4 (AAE) | 99.1 (BERT) | (Rafi-Ur-Rashid et al., 2023) |
| 5-class News (BengFastText) | BengFastText + MConv-LSTM | — | 87.1 | (Karim et al., 2020) |
| 30-class Intent (BNIntent30) | GAN-BnBERT | 96.73 | 96.7 | (Hasan et al., 2023) |
| 4-class Headline (BAN-ABSA, imbalanced) | BERT-CNN-BiLSTM | 81.37 | 81.54 | (Raquib et al., 23 Nov 2025) |
| 6-class Emotion (UBMEC) | mBERT | 61 | 71.03 | (Sourav et al., 2022) |
| Sentiment, 9-way (Daraz reviews) | BSPS→BanglaBERT Hybrid Pipeline | 89 | 89 | (Mahmud et al., 2024) |
| Code-mixed Sentiment (BnSentMix, 4-class) | BERT / XLM-RoBERTa | 69.8 | 69.1 | (Alam et al., 2024) |
| Communal Violence (Ensemble, 4-class) | BanglaBERT Ensemble | — | 63 | (Khondoker et al., 24 Jun 2025) |
Metrics are macro-averaged where reported and reflect held-out or test splits. Classical models (linear SVM, kNN on Doc2Vec) remain competitive (F1 ≈ 87–90%) on balanced news tasks (Islam et al., 2017, Mandal et al., 2014). Adversarial, generative, or hybrid approaches close the gap with resource-intensive models without proportional memory overhead.
6. Interpretability, Error Analysis, and Future Directions
Interpretation of Bengali text classification models increasingly employs representation probing (cosine similarity of word embeddings to diagnose class confusion), LIME for local token-weight explanations, and error cluster analysis. Common error types include:
- Confusion among semantically or temporally proximate labels (e.g., time vs. date vs. distance in intent) (Hasan et al., 2023)
- Overlap in embedding space between communal and non-communal lexicon, leading to high-confidence misclassifications (Khondoker et al., 24 Jun 2025)
- Borderline polarity (Slightly Positive/Neutral/Negative), especially in fine-grained sentiment (Mahmud et al., 2024)
- Underperformance on minority or mixed sentiment classes due to intrinsic imbalance (Alam et al., 2024)
Research priorities include large-scale domain-adaptive pretraining for Bengali, smarter data augmentation (context-aware paraphrase, back-translation), margin-based and contrastive objectives for separation of confusable classes, hierarchical and multitask approaches, and further integration of lexicon-driven and pretrained semantic representations.
Ensemble meta-models, parameter-efficient fine-tuning (LoRA/QLoRA), and hybrid rule–neural pipelines are emerging as best practices for competitive accuracy under computational constraints. Benchmarks, datasets, and codebases are increasingly open-source, supporting reproducibility and rapid field advancement.