BanglishBERT: Code-Mixed NLP Transformer

Updated 11 December 2025

BanglishBERT is a Transformer-based model optimized for Bangla-English code-mixed texts, leveraging the ELECTRA framework for replaced token detection.
It features a unified tokenization and normalization pipeline that processes Bangla, English, and Romanized Bangla to achieve state-of-the-art sentiment classification.
The model integrates multimodal fusion with visual features from ResNet50, demonstrating competitive performance in both text-only and image-text sentiment tasks.

BanglishBERT is a Transformer-based LLM specifically designed for NLP tasks involving Bangla-English code-mixed text and Romanized Bangla (termed “Banglish”). Under the ELECTRA training paradigm, BanglishBERT functions as a discriminator for Replaced Token Detection (RTD), aiming to distinguish original input tokens from those generated as corruptions. Its architecture, pretraining regimen, tokenization, fine-tuning procedures, and empirical evaluation have yielded state-of-the-art results for sentiment analysis and multimodal classification in low-resource Bangla and code-mixed domains, as evidenced by deployments on e-commerce reviews and meme sentiment tasks (Elahi et al., 2023, Shamael et al., 17 Dec 2024).

1. Model Architecture and Training

BanglishBERT is based on the BERT family of Transformer encoder models, adopting the ELECTRA framework for pretraining. Its specification typically follows:

Transformer configuration: 12 encoder layers, hidden size of 768, 12 attention heads, feed-forward dimension of 3072. The parameter count is approximately 100 million for the base model.
Training paradigm: ELECTRA-style pretraining leverages a small generator network to corrupt tokens, with BanglishBERT acting as the discriminator and learning to classify each token as either “real” (original) or “fake” (replaced).
RTD loss function:

$\mathcal{L}_{\mathrm{RTD}} = - \sum_{i=1}^{n}\Big[ \mathbb{1}\{x_i = \tilde x_i\}\log P_{\theta}(d_i = \text{real}\mid \mathbf{h}_i) + \mathbb{1}\{x_i \neq \tilde x_i\}\log P_{\theta}(d_i = \text{fake}\mid \mathbf{h}_i) \Big]$

Masked Language Modeling (MLM) and Next Sentence Prediction objectives are not used during discriminator pretraining (Elahi et al., 2023, Shamael et al., 17 Dec 2024).

2. Data Sources and Preprocessing

Pretraining data for BanglishBERT consists of extensive Bangla and English corpora, as well as code-mixed and transliterated text (“Banglish”), though precise corpus size and domains are not always reported. For the BanglishRev project, 1.747 million e-commerce text reviews with diverse linguistic scripts and code-mixing proportions were used (Shamael et al., 17 Dec 2024).

Tokenization and normalization pipeline:

Unified WordPiece tokenizer with a typical vocabulary of ~50,000 subwords and maximum input length of 256 tokens.
For each review, preprocessing steps include:
- Removal of emojis and punctuation,
- Script recognition (Bangla, English, Banglish via regex and NLTK),
- Romanized Bangla is converted to phonetic Bangla via the Avro remapper,
- Whitespace normalization.
This pipeline ensures mixed-script and “Banglish” tokens are mapped to a robust subword vocabulary (Shamael et al., 17 Dec 2024).

3. Fine-Tuning Procedures

Sentiment classification is the main downstream task. Inputs consist of tokenized review text or meme captions, bounded by [CLS] and [SEP]. The final [CLS] hidden state is passed through a linear classifier:

Sentiment classification:
- For multiclass (e.g. memes): a 3-way softmax layer, cross-entropy loss.
- For binary sentiment (e-commerce): sigmoid/softmax for two labels, cross-entropy loss.
Example fine-tuning hyperparameters:
- Batch size: 32 (meme sentiment), 128 (BanglishRev reviews),
- Optimizer: Adam(W), learning rate typically $1 \times 10^{-5}$ – $5 \times 10^{-5}$ ,
- Epochs: 3–5 (with slight overfitting observed beyond 2–3 epochs for large, noisy datasets),
- Hardware for scale: NVIDIA A100 GPU, 40 GB VRAM (Elahi et al., 2023, Shamael et al., 17 Dec 2024).

4. Multimodal Fusion and Classification

BanglishBERT has been integrated for multimodal classification, particularly in meme sentiment analysis:

Architecture: Textual features ( $\mathbf{h}_{\text{CLS}}$ ) from BanglishBERT are paired with visual features ( $\mathbf{h}_{\text{image}}$ ) from ResNet50 (pretrained on ImageNet).
Fusion pipeline:
- Each modality projective via 1-layer MLP (20 units, ReLU),
- Embeddings concatenated to form a 40-dimensional input to final classifier,
- Cross-entropy loss for final predictions (Elahi et al., 2023).

Performance metrics on held-out meme sentiment test set:

Model	Accuracy	Precision	Recall	F1-score
BanglishBERT (text)	0.66	0.66	0.66	0.66
ResNet50 (image)	0.72	0.67	0.72	0.69
BanglishBERT+ResNet50	0.74	0.69	0.74	0.71

Neutral sentiment remains challenging; multimodal and image-only models failed to classify neutral examples, while BanglishBERT correctly labeled only 6 of 58 (Elahi et al., 2023).

5. Large-Scale Sentiment Classification Using BanglishRev

BanglishBERT, fine-tuned on BanglishRev’s 1.747 million reviews, achieved state-of-the-art results for binary sentiment analysis:

Labeling: Ratings ≥4 classified as positive, ≤3 as negative.
Evaluation: Tested against an out-of-domain, manually annotated dataset (78,000 reviews; Rashid et al. 2024).
Performance:
- Direct fine-tuning: 93% accuracy, F1=0.93 (negative-class F1=0.78).
- BanglishRev pretraining: Up to 95% accuracy, F1=0.94 (negative-class F1=0.79).
- Robustness to review length and code-mixing across scripts (Shamael et al., 17 Dec 2024).
After 1 epoch, performance peaks; further epochs led to slight overfitting.

This suggests the RTD paradigm and large-scale noisy supervision confer strong generalization to out-of-domain reviews and diverse code-mixed compositions.

6. Comparison to Multilingual and Distilled Baselines

BanglishBERT has been benchmarked against mBERT, XLM-R, and smaller distilled models such as Tri-Distil-BERT and Mixed-Distil-BERT (Raihan et al., 2023).

Table: Weighted F1-score on synthetic Banglish test corpus

Task	mBERT	XLM-R	BanglishBERT	Tri-Distil-BERT	Mixed-Distil-BERT
Emotion detection	0.49	0.51	0.47	0.48	0.50
Sentiment analysis	0.74	0.77	0.72	0.69	0.70
Offensive detection	0.88	0.88	0.86	0.86	0.87

Mixed-Distil-BERT, with only 6 layers, closes more than 90% of the performance gap to XLM-R, outperforming BanglishBERT on specific emotion and offensive-language tasks while being substantially more compact. BanglishBERT exhibits competitive sentiment analysis accuracy despite its larger architecture. A plausible implication is that synthetic code-mixed pretraining benefits smaller models most, though larger ELECTRA-based models maintain an edge on authentic human-written code-mixed data.

7. Explainability and Analysis Techniques

Explainable AI methods such as LIME have been applied to interpret BanglishBERT decisions:

LIME visualizations highlight evidence supporting or opposing classification outputs for both image and text models.
- Image: green/red segmentation patches,
- Text: orange, green, blue token spans for positive, negative, neutral sentiment.
BanglishBERT accurately tags sentiment-bearing Bangla and English tokens.
In case studies, the visual model often focuses on faces, neglecting textual sentiment cues, motivating the multimodal approach (Elahi et al., 2023).

8. Limitations and Future Directions

Key limitations observed include:

Class imbalance (majority-positive data in BanglishRev) and difficulty with short reviews.
Inferior neutral-class classification in meme tasks, even under multimodal fusion.
Tokenization robustness: The normalization and subword pipeline mitigates script and spelling variation without special tokenizers.
Overfitting: Increasing epochs on large, noisy data can degrade generalization.

Prospective avenues include spam detection via time-stamped meta-data, leveraging image content in reviews, and pivoting to fine-grained emotion labeling as annotated corpora develop (Shamael et al., 17 Dec 2024).

BanglishBERT represents a scalable ELECTRA-based solution for sentiment and emotion analysis across code-mixed Bangla-English domains, outperforming baseline models in both unimodal and multimodal scenarios, and providing rich interpretability for practical deployment.