Bangla Headline Classification
- Bangla news headline classification is the task of assigning structured thematic and affective labels to headlines, enabling analytics, sentiment analysis, and content routing.
- Datasets like Potrika, BAN-ABSA, and clickbait corpora use NLP augmentation and resampling to mitigate class imbalance and enrich model training.
- Advanced architectures, including GRU+FastText and transformer-based models, achieve state-of-the-art accuracy up to 93.4%, outperforming traditional methods.
Bangla news headline classification is the task of assigning structured thematic or affective labels to headline texts written in the Bangla (Bengali) language, enabling downstream applications such as topic analytics, content routing, bias detection, and real-time information filtering in a low-resource setting. Recent work has established state-of-the-art neural models and publicly available benchmark datasets, systematically addressing class imbalance, model capacity, and transfer learning with a research focus that now spans news categories, sentiment, emotion, and clickbait detection.
1. Datasets and Data Preparation
Large annotated Bangla headline corpora have made rigorous benchmarking possible. The Potrika dataset provides 664,880 news articles with single-label categorization across eight topics: National, Sports, International, Entertainment, Economy, Education, Politics, and Science & Technology, with both raw (imbalanced) and balanced (via NLP augmentation, notably back-translation) versions available (Ahmad et al., 2022). Extracted headlines from Potrika were used for developing manual and automatic labelling pipelines in (Ahmad et al., 2022), yielding balanced splits for robust evaluation. The BAN-ABSA dataset offers 9,014 headlines annotated for both headline aspect (4 classes: Other, Politics, Religion, Sports) and sentiment (Negative, Positive, Neutral), exhibiting aspect and polarity class imbalance up to 1.9× and 2.8× between majority and minority classes, respectively (Raquib et al., 23 Nov 2025). Clickbait detection leverages a 15,056-headline corpus manually annotated by expert linguists, augmented with 65,406 unlabeled examples, both preprocessed via Unicode normalization, deduplication, and stop-word removal (Mahtab et al., 2023). For emotion analysis and affective bias, headline corpora exceeding 300,000 items have been analyzed via zero-shot inference for fine-grained emotion labels (Ameen et al., 20 Oct 2025).
Data preprocessing standards typically include Unicode (NFC) normalization, punctuation and symbol removal, stop-word filtering (using curated Bangla lists), stemming via rule-based or subword n-gram stemmers, and whitespace tokenization. For deep learning (DL), headlines are commonly clipped to 20–64 tokens for sequence models; transformer pipelines use subword tokenization (e.g., byte-pair encoding or WordPiece) with pretrained vocabularies up to 250k tokens (Alam et al., 2020, Ahmad et al., 2022).
2. Feature Engineering and Embedding Techniques
Classical approaches rely on Bag-of-Words (BoW) and TF–IDF, utilizing vocabulary sizes up to 3,000 unigrams/bigrams (Ahmad et al., 2022, Ahmad et al., 2022). Embedding-based representations leverage distributed models:
- Word2Vec, GloVe, FastText: Pretrained on full news corpora; headline embeddings by average pooling of token vectors (dimension=300). FastText offers robust OOV handling via subword n-grams, critical for morphologically rich Bangla and noisy headline input.
- Doc2Vec: Distributed memory, 300 dimensions, provides a dense vector for each headline.
- Transformer embeddings: Pretrained BERT-based or XLM-RoBERTa-based encoders, with the [CLS] token's final hidden state as the aggregate headline representation (Alam et al., 2020, Raquib et al., 23 Nov 2025).
For fine-tuning, input sequences are tokenized and padded (≤64 for news classification, ≤300 for joint aspect/sentiment pipelines), with embedded representations feeding into classification architectures or fusion modules.
3. Classification Architectures and Training Paradigms
A spectrum of supervised and semi-supervised architectures are established:
- Classical ML: Logistic Regression (LR), Linear SVM, SGDClassifier, Random Forest (RF), K-Nearest Neighbor (KNN). These are built atop BoW, TF-IDF, or Doc2Vec features (Ahmad et al., 2022, Ahmad et al., 2022). Regularization, Laplace smoothing, and linear kernels are standard choices.
- Neural DL: LSTM, BiLSTM, GRU, CNN. Architectures typically use pretrained static embeddings as non-trainable input layers, followed by task-specific layers (e.g., stacked GRU with dropout, CNN with multi-window convolution and global max-pooling) (Ahmad et al., 2022). GRU+FastText achieves the highest single-label accuracy (91.8%) on balanced headline sets (Ahmad et al., 2022).
- Hybrid and Transformer Models: The BERT-CNN-BiLSTM framework applies a Bangla-BERT encoder (hidden size 768) whose output is processed by parallel CNN (with three 1-D filter banks, window sizes 2–4, 128 filters each) and BiLSTM (hidden size 128 per direction), concatenated for a final classification layer. After batch-norm, dropout (0.35), and dense projection, softmax yields class probabilities (Raquib et al., 23 Nov 2025). This model attains state-of-the-art for both headline and joint sentiment-aspect classification.
- Transformer Fine-tuning: Multilingual BERT (mBERT), XLM-RoBERTa-base, and XLM-RoBERTa-large (up to 550M parameters) yield up to 93.4% accuracy and 93.4% weighted-F1 on six-category Bangla headline classification, outstripping best prior (FT-WC, 64.8% accuracy) by 44.6 percentage points (Alam et al., 2020).
Semi-supervised adversarial approaches such as GAN-BanglaBERT employ a transformer-based encoder as a discriminator, augmented by a two-layer generator trained to produce realistic feature representations, enhancing clickbait detection F1 to 75.1% (vs. 71.7% for supervised BanglaBERT, 76.8% for human annotators) (Mahtab et al., 2023).
4. Handling Imbalance, Augmentation, and Labelling Strategies
Highly imbalanced datasets necessitate explicit resampling or augmentation:
- Oversampling/Undersampling: BAN-ABSA experiments with both resample→split and split→resample approaches. Oversampling before data split improves balanced test F1 (79.4% for headline task), but the highest raw accuracy (81.4%) is achieved by training directly on original imbalanced data (Raquib et al., 23 Nov 2025).
- Data Augmentation: Potrika balanced corpus is built by back-translation (Bangla→English→Bangla), repeated until minority class quotas are filled (Ahmad et al., 2022). Synonym replacement, insertion, swap, and deletion operations (Easy Data Augmentation) further diversify training data. For deep neural models, class balancing is essential for high accuracy, with back-translation reducing bias.
- Automatic Labelling with LDA: Latent Dirichlet Allocation (LDA) is used to auto-assign single- or multi-labels to unlabelled headlines by topic mixture, with performance up to 57.2% (single-label) and 75% (multi-label) accuracy depending on threshold (Ahmad et al., 2022). However, manual labelling and balancing consistently yield superior results.
- Label smoothing () and weighted cross-entropy are employed to stabilize training with class-imbalanced data (Raquib et al., 23 Nov 2025, Alam et al., 2020).
5. Evaluation Metrics and Empirical Performance
Standard evaluation metrics are systematically employed: per-class precision, recall, and F1-score, as well as macro-/weighted-F1 and overall accuracy, defined by
where is the number of classes. For multi-label setups, subset accuracy and Hamming Loss complement standard metrics (Ahmad et al., 2022).
Benchmark results include:
| Model | Dataset/Context | Accuracy (%) | F1 (%) |
|---|---|---|---|
| GRU+FastText | Potrika, manual labels (Ahmad et al., 2022) | 91.8 | – |
| BERT-CNN-BiLSTM | BAN-ABSA, imbalanced (Raquib et al., 23 Nov 2025) | 81.4 | 81.5 |
| XLM-R-large | 6 cat news (Alam et al., 2020) | 93.4 | 93.4 |
| GAN-BanglaBERT | Clickbait detection (Mahtab et al., 2023) | 82.6 | 75.1 |
Classical baselines (Logistic Regression, SVM, KNN with Doc2Vec/TF–IDF) range 54–87% accuracy depending on corpus balance and feature richness (Ahmad et al., 2022). Machine learning models perform poorly on short headlines unless dense semantic embeddings are supplied.
6. Advances in Zero-Shot and Affective News Classification
Zero-shot LLMs, such as Gemma-3 4B, have demonstrated effective headline classification in both topic and emotion domains (Ameen et al., 20 Oct 2025). For emotion analysis, 300,000 Bengali headlines were annotated in a single Gemma-3 4B inference run using JSON-constrained prompts across 28 emotions, revealing a strong negativity bias: over 50% of headlines are classified as anger, sadness, disappointment, or fear. The zero-shot pipeline can be redirected from emotion to headline topic by swapping the label set in the prompt or incorporating few-shot examples for calibration. Cosine similarity between pooled LLM embeddings (e.g., via SimCSE) is a viable alternative. The absence of gold-standard emotion annotations limits metric-based evaluation; distributional and temporal analyses are applied instead.
A plausible implication is that zero-shot and embedding-based strategies are immediately transferable: minimal adaptation (prompt engineering, adapters/LoRA fine-tuning) yields competitive headlining models in both emotion and classic news topic axes, especially as LLMs trained on increasingly diverse Bengali web corpora proliferate.
7. Best Practices, Limitations, and Deployment Considerations
Empirical findings from multiple research groups converge on best practices:
- Fine-tune transformer models (preferably XLM-R-large or Bangla-BERT) at low learning rates (); use batch sizes 16–32, sequence lengths up to 64 for headlines, and aggressive dropout (0.1–0.5) to regularize.
- In low-resource and imbalanced settings, oversample minority classes using back-translation.
- Avoid data undersampling, which discards contextual samples and degrades performance.
- For short-text headlines, use pretrained embeddings (FastText or transformer-based) to counteract vocabulary sparsity. GRU and BiLSTM are computationally efficient and robust.
- For multi-task objectives (topic + sentiment/emotion), joint BERT-based fusion models outperform isolated pipelines (Raquib et al., 23 Nov 2025).
- For practical deployment, serve headline classifiers as REST APIs with JSON in/out, integrate with live RSS feeds, and use dashboards for monitoring class and sentiment drift.
- Attention to OOV handling, Unicode normalization, and class balancing remains critical for robust downstream performance (Ahmad et al., 2022, Ahmad et al., 2022).
- Semi-supervised and adversarial frameworks (GAN-BanglaBERT) match human annotation limits on clickbait detection, suggesting viability for other nuanced classification tasks given sufficient unlabeled data (Mahtab et al., 2023).
- Always stratify evaluation splits and report macro-F1 for fair comparison, especially with highly skewed label sets.
A potential limitation is the dependency on large labeled corpora for fine-tuning; unsupervised or label-efficient methods (zero-shot, prompt-based, or label-propagation strategies) are increasing in relevance as newer LLMs are adapted for Bangla (Ameen et al., 20 Oct 2025).
References
- (Raquib et al., 23 Nov 2025) "A Unified BERT-CNN-BiLSTM Framework for Simultaneous Headline Classification and Sentiment Analysis of Bangla News"
- (Alam et al., 2020) "Bangla Text Classification using Transformers"
- (Ahmad et al., 2022) "Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes"
- (Ahmad et al., 2022) "Machine and Deep Learning Methods with Manual and Automatic Labelling for News Classification in Bangla Language"
- (Mahtab et al., 2023) "BanglaBait: Semi-Supervised Adversarial Approach for Clickbait Detection on Bangla Clickbait Dataset"
- (Ameen et al., 20 Oct 2025) "How News Feels: Understanding Affective Bias in Multilingual Headlines for Human-Centered Media Design"