BanglishBERT: Deep Code-Mixed Language Models

Updated 7 December 2025

BanglishBERT is a deep Transformer-based language model designed to process and understand Bangla-English code-mixed text through multilingual pretraining and subword tokenization.
It leverages both BERT and ELECTRA architectures, employing normalization and transliteration techniques to handle native-script and romanized forms effectively.
The model achieves state-of-the-art sentiment analysis on social media and e-commerce benchmarks, with multimodal fusion further improving classification performance.

BanglishBERT is a family of deep Transformer-based LLMs designed for the processing and understanding of Bangla–English code-mixed (“Banglish”) text, including both native-script and romanized (“Banglish”) forms. Models in this lineage are chiefly based on the BERT and ELECTRA frameworks, incorporating multilingual corpora, subword tokenization, and specialized pretraining schemes to robustly encode code-mixed text for downstream natural language processing tasks. BanglishBERT has demonstrated state-of-the-art performance for sentiment analysis in complex social media and e-commerce scenarios, especially those dominated by idiosyncratic code-switching and diverse user-generated orthographies (Elahi et al., 2023, Shamael et al., 2024, Raihan et al., 2023).

1. Model Architecture and Pretraining

BanglishBERT models adopt Transformer encoder backbones, predominantly following the “base” configuration of 12 layers, 768-dimensional hidden states, and 12 attention heads, as reported for both BERT-Base and ELECTRA-Discriminator variants used in foundational work (Elahi et al., 2023, Shamael et al., 2024). The tokenization is performed via a WordPiece algorithm with a vocabulary size of approximately 30,000–50,000 subwords, trained on mixed Bangla and English corpora. This facilitates granular representation of diverse script and phonetically romanized forms prevalent in Banglish corpora.

The pretraining objective for the ELECTRA-based versions is Replaced Token Detection (RTD):

$\mathcal{L}_{\mathrm{RTD}} = -\sum_{i=1}^N \log P\bigl(y_i \mid \tilde{x}\bigr)$

where $\tilde{x}$ denotes input with token-level replacements generated by a small generator network, and the discriminator (“BanglishBERT”) classifies each token as original or replaced.

For distilled variants, as explored in "Mixed-Distil-BERT" (Raihan et al., 2023), the loss incorporates masked language modeling (MLM), layer-wise distillation, and hidden-state cosine alignment:

$L_{\mathrm{total}} = \alpha \cdot L_{\mathrm{MLM}} + \beta \cdot L_{\mathrm{distil}} + \gamma \cdot L_{\mathrm{cos}}$

No architectural modifications (such as code-switch gates) are introduced; handling of code-mixed text relies on normalization and pretrained weights on mixed-script corpora.

2. Data Preprocessing and Corpus Construction

Data ingestion pipelines employ Unicode normalization to standardize variant punctuation and repeated-character forms, following approaches such as the Hasan et al. (EMNLP 2020) script normalizer (Elahi et al., 2023). Language identification is routinely performed using regular expression matching (for native Bangla characters) in conjunction with dictionary-based heuristics (as in the BanglishRev pipeline). Tokens neither recognized as Bangla nor English are classified as Banglish and mapped into phonetic Bangla using transliteration systems (e.g., Avro) (Shamael et al., 2024).

BanglishBERT leverages large-scale mixed-text corpora: for example, the BanglishRev experiment is trained on 1.74 million product reviews, with explicit class balance statistics provided for Bangla (25%), English (31%), code-mixed native script (5.9%), and romanized Banglish (37.7%) (Shamael et al., 2024). Inputs are truncated or zero-padded to a maximum of 256 tokens, and no code-mix-specific special tokens are introduced beyond standard subword boundaries.

3. Fine-tuning Strategies and Task Configurations

Downstream utilization of BanglishBERT predominantly targets text classification tasks, including sentiment and emotion analysis. Fine-tuning typically adds a single-layer (linear) classification head atop the [CLS] token embedding, with choices of loss functions and metrics determined by task formulation.

Binary sentiment (BanglishRev): Reviews with rating $>3$ are labeled positive $(y = 1)$ , and those with $≤3$ as negative $(y = 0)$ . The model is trained with binary cross-entropy:

$\mathcal{L}_{\mathrm{CE}} = -\bigl[ y \log p + (1 - y)\log (1 - p) \bigr]$

Multiclass sentiment (MemoSen): For the three-class task (positive/negative/neutral), standard cross-entropy is used:

$\mathcal{L}_{\mathrm{ce}} = - \frac{1}{N}\sum_{i=1}^N \sum_{c=1}^3 y_{i,c} \log p_{i,c}$

Hyperparameters vary by experiment; representative configurations include Adam or AdamW optimizers with learning rates between $1\times10^{-5}$ and $5\times10^{-5}$ , batch sizes up to 128, and 3 fine-tuning epochs (as in BanglishRev). Regularization (dropout of 0.1) is inherited from the base Transformer architecture. No explicit data augmentation or class-weighted losses are reported for meme or review sentiment tasks (Elahi et al., 2023, Shamael et al., 2024).

4. Multimodal and Code-mixed Modeling

BanglishBERT has been integrated into multimodal architectures for meme sentiment analysis (Elahi et al., 2023), where textual and image representations are learned jointly. Text embeddings from BanglishBERT and image features (e.g., from ResNet50) are projected into intermediate representations ( $h_t$ , $h_i$ ), concatenated, and followed by a joint classification layer:

$h_t = W_t e_t + b_t, \quad h_i = W_i e_i + b_i$

$h = [h_t; h_i], \quad \mathrm{logits} = W_o h + b_o, \quad p = \mathrm{softmax}(\mathrm{logits})$

End-to-end joint fine-tuning is performed across both pathways, and ablation demonstrates additive F1 improvement for fused models.

Code-mixed modeling approaches, as surveyed in "Mixed-Distil-BERT" (Raihan et al., 2023), emphasize the importance of synthetic and real code-mixed data, as well as tailored vocabularies to capture transliteration variants. Comparative evaluations on sentiment and emotion classification benchmarks (SAIL_BD, OLID, SMED) indicate that models pretrained with genuine code-mixed corpora systematically outperform monolingual and multilingual models not exposed to such material, consistent with findings in BanglishBERT evaluations.

5. Performance Benchmarks and Ablations

Sentiment Analysis (BanglishRev):

Trained and evaluated against a gold-labeled set of 78,130 reviews (Bangla, English, code-mixed), BanglishBERT achieves:
- Overall accuracy: 0.94
- Weighted F1: 0.94
- Positive class F1: 0.97
- Negative class F1: 0.78

Most errors are concentrated among negative-class examples (lower recall), attributable to imbalance (78% of reviews are 5-star) and label noise from threshold-based sentiment binarization (Shamael et al., 2024).

Meme Sentiment (MemoSen):

For code-mixed meme caption sentiment, BanglishBERT attains:
- Text-only (BanglishBERT): Accuracy = 0.66, Weighted F1 = 0.66
- Image-only (ResNet50): Accuracy = 0.72, Weighted F1 = 0.69
- Multimodal (BanglishBERT + ResNet50): Accuracy = 0.74, Weighted F1 = 0.71

Ablation confirms a +0.05 absolute gain in F1 from text-image fusion (Elahi et al., 2023).

Code-mixed Benchmarks (from Mixed-Distil-BERT):

Mixed-Distil-BERT surpasses previous BanglishBERT, mBERT, and XLM-R on weighted F1 for three-language synthetic emotion, sentiment, and offensive tasks, with gains of +0.02–0.05 F1 observed over monolingual/multilingual pretraining (Raihan et al., 2023).

Model	Emotion F1	Sentiment F1	Offensive F1
DistilBERT	0.40	0.66	0.80
mBERT	0.49	0.74	0.88
XLM-R	0.51	0.77	0.88
BanglishBERT (prev)	0.47	0.72	0.86
Tri-Distil-BERT	0.48	0.69	0.86
Mixed-Distil-BERT	0.50	0.70	0.87

6. Explainability and Analysis

Explainable AI (XAI) methods such as LIME have been employed to interpret BanglishBERT outputs. For meme sentiment, token-level attributions reveal that code-mixed tokens (including loanwords or Banglish morphs like “business,” “baka”) dominate sentiment cues, whereas image-only models focus on faces and miss caption signal (Elahi et al., 2023). LIME analysis also demonstrates that neutral class labels are rarely predicted correctly; neutral captions share substantial lexical overlap with polarized classes, and are underrepresented in available datasets.

In e-commerce reviews, analysis of misclassification patterns indicates that misspelling and inconsistent romanization introduce errors in detecting Banglish, highlighting the importance of robust text normalization and tokenization (Shamael et al., 2024).

7. Challenges, Limitations, and Future Directions

The primary challenges in BanglishBERT modeling are:

Orthographic and transliteration variability: Out-of-vocabulary handling for diverse spelling variants remains a bottleneck. The potential for vocabulary augmentation with transliteration-specific subwords is underlined in (Raihan et al., 2023).
Class imbalance and label noise: Unbalanced datasets (as in BanglishRev and MemoSen) reduce recall for minority classes, particularly negatives and neutrals.
Evaluation on broader tasks: While sentiment analysis is well studied, directions such as aspect extraction, recommendation, and adversarial or spam detection are only beginning to be explored with code-mixed models (Shamael et al., 2024).
Multimodal extension: Integration of textual and visual inputs shows promising gains, motivating further research into joint representations, particularly for user-generated social and e-commerce content (Elahi et al., 2023).

A plausible implication is that expanding pretraining with in-domain code-mixed corpora—alongside auxiliary objectives such as language-tag prediction and contrastive alignment of monolingual vs. code-mixed encodings—could further enhance BanglishBERT’s utility and generalizability (Raihan et al., 2023).

Markdown Upgrade to Chat

References (3)

Explainable Multimodal Sentiment Analysis on Bengali Memes (2023)

BanglishRev: A Large-Scale Bangla-English and Code-mixed Dataset of Product Reviews in E-Commerce (2024)

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BanglishBERT.