Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bengali Text Classification Overview

Updated 24 January 2026
  • Bengali text classification is the process of assigning structured labels to unstructured Bengali texts, addressing challenges like resource scarcity, morphological richness, and code-mixing.
  • It leverages diverse methods from classical TF-IDF and SVM approaches to deep learning and transformer-based models, achieving significant performance improvements.
  • Research employs innovative annotation schemes, feature engineering techniques, and ensemble strategies to mitigate class imbalance and enhance model robustness.

Bengali Text Classification is the task of assigning structured labels (topic, sentiment, style, intent, emotion, etc.) to unstructured Bengali-language texts, including documents, sentences, or utterances, under resource and morphological challenges characteristic of the language. The field spans classical supervised learning, deep learning, and recent advances in self-supervised and generative modeling, as applied to newspaper categorization, opinion mining, code-mixed sentiment, intent detection, authorship attribution, and communal-violence identification.

1. Datasets and Annotation Paradigms

Bengali text classification research has established a range of datasets differing in size, source, domain, and label structure. Major contributions include:

Annotation strategies vary from manual majority voting (Potrika, BAN-ABSA, Daraz) to machine-guided labeling (LDA topic induction (Ahmad et al., 2022)), code-mixed filtering with mBERT for code-switch detection (Alam et al., 2024), and expert panel refinement for subjective or subjective/violent content (Khondoker et al., 24 Jun 2025). Notable is the use of multi-label classification for noisy labels and the prevalence of stratified splits to mitigate class imbalance.

2. Feature Engineering and Embedding Approaches

The progression from lexical to distributed semantic representations in Bengali classification mirrors state-of-the-art multilingual NLP:

  • Bag of Words and TF–IDF: Classical frequency-based vectors and term weighting remain competitive; SVM on normalized TF–IDF yields up to 92.6% F1 on 12-class news categorization (Islam et al., 2017), with similar effectiveness in web document classification (Mandal et al., 2014).
  • Distributional Embeddings: Word2Vec and GloVe support deeper learning models; doc2vec is effective for ML baselines (Ahmad et al., 2022). FastText—particularly BengFastText, trained on 250M Bengali articles—outperforms static embeddings on both classification and analogy tasks (Karim et al., 2020).
  • Sequence and Subword Models: Trainable embedding layers (128 to 300-dimensions) are central for large corpora (Rafi-Ur-Rashid et al., 2023). Subword tokenization underlies transformer input, mitigating OOV and morphological sparsity (Alam et al., 2020), essential for code-mixed and morphologically complex content (Alam et al., 2024).
  • Hybrid and Lexicon Features: Domain-adapted lexica, such as the 1,500-word sentiment polarity dictionary, produce rule-based pseudo-labels that, when fused with neural models, enhance performance on nuanced sentiment categories (Mahmud et al., 2024).

3. Model Architectures and Optimization Strategies

A spectrum of model architectures has been systematically benchmarked and extended to Bengali:

  • Classical Learners: SVMs, logistic regression, SGD classifiers, random forests, and k-nearest neighbors, with doc2vec or TF–IDF features, remain robust for high-dimensional structured text (Islam et al., 2017, Mandal et al., 2014).
  • Deep Neural Networks: CNNs, (Bi)GRUs, and (Bi)LSTMs are dominant for sequential modeling, typically initialized with pre-trained embeddings (Ahmad et al., 2022, Hasan et al., 2020, Karim et al., 2020). Hybrid frameworks (e.g., CNN + BiLSTM + Attention) outperform single-branch baselines for headlines and sentiment (Raquib et al., 23 Nov 2025).
  • Generative Models: Deep generative approaches (LSTM-VAE, AC-GAN, AAE) provide compressed, discriminative latent spaces for classification tasks, with AAE features yielding ≈98.4% F1 in seven-way news discrimination and approaching BERT-level performance at a fraction of the vector size (Rafi-Ur-Rashid et al., 2023).
  • Transformers and Pretrained LLMs: mBERT, XLM-RoBERTa, BanglaBERT, and Qwen variants set new performance ceilings in all major subfields. Monolingual BanglaBERT and multilingual XLM-RoBERTa consistently surpass earlier models by 5–29% accuracy across tasks (Alam et al., 2020, Khondoker et al., 24 Jun 2025). LLMs with instruction-tuning, LoRA adapters, and 4-bit QLoRA quantization (e.g., Qwen 2.5-7B, LLaMA 3.x) have achieved up to 72% accuracy and F1≈74.2% on balanced nine-class news categorization (Hoque et al., 17 Jan 2026).
  • Adversarial and Semi-supervised Learning: GAN-augmented BERT (GAN-BnBERT) introduces a generator-discriminator structure that modestly but consistently improves intent classification (Δaccuracy +0.68pp over standard BERT) and smoother convergence (Hasan et al., 2023).

4. Task-Specific Adaptations and Multilingual Challenges

Bengali text classification confronts low-resource constraints, morphological complexity, code-mixing, and class imbalance:

  • Code-mixed Sentiment: BnSentMix demonstrates that fine-tuned transformers (English BERT, mBERT, XLM-RoBERTa, BanglaBERT) perform on par (Acc/F1 ≈ 69.8%) for code-mixed Bengali–English sentiment, with "mixed" labels being hardest to accurately identify (Alam et al., 2024).
  • Fine-grained and Multi-label Tasks: Multi-label and hierarchical variants (LDA+KNN for topic, hybrid lexicon-BERT cascades for nine-class sentiment (Mahmud et al., 2024)) address nuanced opinions, aspect-based sentiment, and ambiguous utterances.
  • Commonsense and Contextual Cues: Integrating explicit linguistic knowledge (e.g., Bengali WordNet-based sense definitions (Pal et al., 2015)) or detailed stylometric markers (n=76 in stylometry (Chakraborty et al., 2012)) improves disambiguation tasks.
  • Mitigating Imbalanced and Noisy Data: Undersampling/oversampling, weighted loss, and paraphrastic augmentation reduce bias towards majority classes (Raquib et al., 23 Nov 2025, Khondoker et al., 24 Jun 2025), with ensemble voting further enhancing stability and generalizability.

5. Performance Benchmarks and Comparative Analysis

Empirical results across major benchmarks consistently establish the superiority of transformer-based architectures, but also provide strong baselines from classical and hybrid models. Illustrative summary:

Task/Corpus Best Model Accuracy (%) Macro-F1 (%) Reference
9-class News (Kaggle Prothom Alo, balanced) Qwen 2.5-7B + QLoRA 72.0 ~74.2 (Hoque et al., 17 Jan 2026)
8-class News (Potrika) GRU + FastText 91.83 ~90 (Ahmad et al., 2022)
7-class News (Fang et al.) AAE (32-dim), BERT (768-dim) 98.4 (AAE) 99.1 (BERT) (Rafi-Ur-Rashid et al., 2023)
5-class News (BengFastText) BengFastText + MConv-LSTM — 87.1 (Karim et al., 2020)
30-class Intent (BNIntent30) GAN-BnBERT 96.73 96.7 (Hasan et al., 2023)
4-class Headline (BAN-ABSA, imbalanced) BERT-CNN-BiLSTM 81.37 81.54 (Raquib et al., 23 Nov 2025)
6-class Emotion (UBMEC) mBERT 61 71.03 (Sourav et al., 2022)
Sentiment, 9-way (Daraz reviews) BSPS→BanglaBERT Hybrid Pipeline 89 89 (Mahmud et al., 2024)
Code-mixed Sentiment (BnSentMix, 4-class) BERT / XLM-RoBERTa 69.8 69.1 (Alam et al., 2024)
Communal Violence (Ensemble, 4-class) BanglaBERT Ensemble — 63 (Khondoker et al., 24 Jun 2025)

Metrics are macro-averaged where reported and reflect held-out or test splits. Classical models (linear SVM, kNN on Doc2Vec) remain competitive (F1 ≈ 87–90%) on balanced news tasks (Islam et al., 2017, Mandal et al., 2014). Adversarial, generative, or hybrid approaches close the gap with resource-intensive models without proportional memory overhead.

6. Interpretability, Error Analysis, and Future Directions

Interpretation of Bengali text classification models increasingly employs representation probing (cosine similarity of word embeddings to diagnose class confusion), LIME for local token-weight explanations, and error cluster analysis. Common error types include:

  • Confusion among semantically or temporally proximate labels (e.g., time vs. date vs. distance in intent) (Hasan et al., 2023)
  • Overlap in embedding space between communal and non-communal lexicon, leading to high-confidence misclassifications (Khondoker et al., 24 Jun 2025)
  • Borderline polarity (Slightly Positive/Neutral/Negative), especially in fine-grained sentiment (Mahmud et al., 2024)
  • Underperformance on minority or mixed sentiment classes due to intrinsic imbalance (Alam et al., 2024)

Research priorities include large-scale domain-adaptive pretraining for Bengali, smarter data augmentation (context-aware paraphrase, back-translation), margin-based and contrastive objectives for separation of confusable classes, hierarchical and multitask approaches, and further integration of lexicon-driven and pretrained semantic representations.

Ensemble meta-models, parameter-efficient fine-tuning (LoRA/QLoRA), and hybrid rule–neural pipelines are emerging as best practices for competitive accuracy under computational constraints. Benchmarks, datasets, and codebases are increasingly open-source, supporting reproducibility and rapid field advancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bengali Text Classification.