MahaBERT: Marathi NLP Transformer
- MahaBERT is a BERT-style transformer model tailored for Marathi, a morphologically rich and low-resource language.
- It leverages aggressive layer pruning and specialized pretraining on large Marathi corpora to enhance performance in tasks like text classification, sentiment analysis, and NER.
- Empirical results show that both full and pruned variants outperform multilingual baselines, establishing MahaBERT as a leading solution in efficient low-resource NLP.
MahaBERT is a family of BERT-style pretrained transformer models tailored for Marathi, a morphologically rich, low-resource Indo-Aryan language. Designed by the L3Cube Pune research group, MahaBERT and its derivatives represent the foundation for high-accuracy Marathi NLP applications, offering strong empirical gains over multilingual BERT models for text classification, sentiment analysis, named entity recognition (NER), sentence embeddings, and hate speech detection. The MahaBERT line, including MahaBERT-v2, MahaBERT-Small, and MahaBERT-Smaller, is central to research on efficient model adaptation via aggressive layer pruning in resource-constrained settings (Shirke et al., 1 Jan 2025), and has catalyzed the emergence of Marathi-specific NLP datasets, pipelines, and evaluation standards (Joshi, 2022, Patil et al., 2022, Mittal et al., 2024, Pingle et al., 2023).
1. Architecture and Pretraining Regimen
The MahaBERT architecture strictly follows the BERT-Base configuration:
- 12 Transformer encoder layers ()
- Hidden state dimensionality
- 12 self-attention heads per layer
- Intermediate feed-forward size = 3072
- Total parameter count 110 million
- Tokenization: WordPiece vocabulary of approximately 30,000 subwords, cased, constructed from large-scale monolingual Marathi corpora
- Embeddings: token, segment, position (all 768-dim); input = sum of their respective vectors per token
Pretraining is exclusively Marathi: corpus sources include the L3Cube-MahaCorpus (scraped web/news/documents, 212M news + 76.4M non-news tokens), Wikipedia archives, OSCAR/CC-100, Netshika, and the IIT Bombay parallel Marathi corpus, for an aggregate of 752 million tokens (Joshi, 2022). The primary pretraining objective is Masked Language Modeling (MLM); Next Sentence Prediction (NSP) is used in some variants (notably MahaBERT-v2 and "MarathiBERT V2" (Chavan et al., 2022)), but omitted in earlier and RoBERTa-style variants (Joshi, 2022, Patil et al., 2022).
The training schedule for MLM is standard: AdamW optimizer (, weight decay $0.01$), learning rate , batch size 64–256, maximum sequence length 512, typical up to 2 epochs over the full corpus (Joshi, 2022, Joshi, 2022). Downstream tasks adopt application-specific heads (single-layer softmax for classification, token-classification for NER), trained with cross-entropy loss functions.
| Model | Layers | Hidden | Heads | Parameters | Corpus |
|---|---|---|---|---|---|
| MahaBERT | 12 | 768 | 12 | ~110M | ~750M tokens |
| MahaBERT-v2 | 12 | 768 | 12 | ~110M | ≥1B tokens |
| MahaBERT-Small | 6 | 768 | 12 | ~55M | scratch-trained |
| MahaBERT-Smaller | 2 | 768 | 12 | ~18M | scratch-trained |
2. Layer Pruning and Model Compression
MahaBERT models are a principal testbed for structured layer pruning, targeting practical deployment for low-resource languages. The formalism: for an -layer model, prune layers to produce an model. Three positional strategies are used:
- Top-pruning: remove final layers ( to )
- Bottom-pruning: remove initial layers (1 to )
- Middle-pruning: remove contiguous block of layers centered within the stack
Experiments typically prune 6 () or 10 () layers from MahaBERT-v2 (), yielding 6-layer and 2-layer models. This approach yields significant resource savings: reduces model size and compute by 50%, by 83% (Shirke et al., 1 Jan 2025, Shelke et al., 2024). Inference speedup is empirically measured at 1.8 (6-layer) and %%%%2526%%%% (2-layer) with similar proportional reductions in FLOPs.
Performance remains competitive post-pruning: pruned MahaBERT-v2 models (6/2 layers) maintain accuracy and embedding quality close to the full 12-layer baseline, and consistently outperform scratch-trained MahaBERT-Small/Smaller at identical size (Shirke et al., 1 Jan 2025, Shelke et al., 2024). Middle-pruning, while effective, is generally matched or outperformed by top-pruning for sentence encoding (Shelke et al., 2024), but optimal strategy is task-dependent.
3. Downstream Tasks and Empirical Results
MahaBERT and its pruned variants are evaluated on a wide array of Marathi NLP tasks. Key benchmarks include:
- Topic Classification: L3Cube-MahaNews (Mittal et al., 2024), SHC (short headlines), LPC (paragraphs), LDC (long documents)—all 12-way multi-class tasks
- Sentiment Analysis: L3Cube-MahaSent and MahaSent-MD (multi-domain) (Pingle et al., 2023)
- Hate Speech Detection: L3Cube-MahaHate (Velankar et al., 2022), HASOC (Chavan et al., 2022)
- Named Entity Recognition: L3Cube-MahaNER (Patil et al., 2022)
- Sentence Embeddings: MahaSBERT/STS (Joshi et al., 2022, Shelke et al., 2024)
Performance consistently shows monolingual MahaBERT variants surpassing multilingual baselines (mBERT, IndicBERT, MuRIL, XLM-R) across all tasks, often by 1–3 points in accuracy or macro-F1, with larger margins on NER and classification (Joshi, 2022, Patil et al., 2022, Shirke et al., 1 Jan 2025). For hate/offensive tasks, MahaBERT is marginally outperformed only by tweet-domain-specialized MahaTweetBERT, but consistently bests MuRIL on general data (Chavan et al., 2022).
Layer pruning has limited adverse effect on end-task accuracy. For example, a MahaBERT-v2 model pruned to 6 layers via top/middle/bottom strategies achieves absolute accuracy up to 92% (SHC), 91% (LPC), and 90% (LDC), matching or exceeding full-size baselines on specific datasets (Shirke et al., 1 Jan 2025).
| Model | SHC | LPC | LDC |
|---|---|---|---|
| MahaBERT-v2 (12L) | 91.41 | 88.75 | 94.78 |
| MahaBERT-v2 Top6 (6L) | 92.18 | 90.80 | 89.35 |
| MahaBERT-Small (6L) | 88.81 | 89.46 | 85.04 |
In sentence embedding evaluation (intrinsic: STS Spearman ρ; extrinsic: downstream kNN) pruned MahaBERT-v2 (6L, 2L) with SBERT-style finetuning achieves up to 90–95% of full-model performance (ρ=0.7878 for 6L, vs. 0.8320 for 12L after two-phase fine-tuning) (Shelke et al., 2024, Joshi et al., 2022).
4. Sentence Embedding and MahaSBERT
MahaBERT forms the backbone of Marathi sentence-BERT (SBERT) adaptations, supporting strong semantic similarity and retrieval (Joshi et al., 2022, Shelke et al., 2024). Training uses a two-phase approach:
- Stage 1: NLI fine-tuning (MultipleNegativesRankingLoss)
- Stage 2: STS fine-tuning (CosineSimilarityLoss)
For sentence embedding extraction, mean/CLS/max pooling strategies are compared; mean pooling on the final layer generally yields the best results for monolingual models. Two-step NLI→STS training with MahaBERT consistently achieves higher STS correlation (ρ=0.83) than one-step or multilingual alternatives (Joshi et al., 2022). Pruned models (6L/2L) post-fine-tuning retain nearly all downstream utility (e.g., ρ=0.7878 for 6L) (Shelke et al., 2024).
Layer pruning yields efficient, deployable SBERT models suitable for Marathi-language search and clustering, with pruned MahaBERT-based SBERT outperforming both small scratch-trained and multilingual embedding models at similar parameter budgets (Shelke et al., 2024).
5. Embedding Efficiency: Contextual and Non-Contextual Features
MahaBERT supports both contextual BERT embeddings (full layer stack inference) and non-contextual embeddings (first-layer sum: ), the latter obviating transformer computation. Non-contextual MahaBERT embeddings, when averaged over tokens, outperform FastText and other non-contextual baselines across sentiment and topic classification, approaching the accuracy of contextual models with much lower runtime cost (Shanbhag et al., 2024). However, contextual (full-stack) representations are optimal when compute is not a constraint.
| Embedding Type | MahaSent | SHC | LDC |
|---|---|---|---|
| Contextual MahaBERT | 82.27 | 89.83 | 93.87 |
| Non-contextual MahaBERT | 77.56 | 86.45 | 91.69 |
| FastText MahaFT | 78.62 | 85.89 | 92.62 |
6. Comparative Analysis and Impact in Low-Resource Settings
The repeated empirical finding is that monolingual, corpus-specialized BERTs (MahaBERT family) provide significant and consistent accuracy gains in downstream tasks relative to their multilingual counterparts (Joshi, 2022, Velankar et al., 2022, Shirke et al., 1 Jan 2025). This is attributed to a larger, higher-quality vocabulary adapted to Marathi morphology, improved subword/compound representations, and an absence of cross-lingual negative transfer.
Layer-pruning methodology provides a principled means to realize these gains with reduced compute and storage requirements, a key concern for Marathi and similar languages. Pruned MahaBERT variants outperform equivalently sized scratch-trained models in both classification and sentence encoding tasks, showing that pretraining depth and capacity are effectively leveraged even after model reduction (Shirke et al., 1 Jan 2025, Shelke et al., 2024).
7. Limitations, Future Directions, and Public Resources
While MahaBERT consistently sets state-of-the-art baselines for Marathi-topic, sentiment, and hate-speech classification, known limitations include limited generalizability of "freeze" sentence embeddings to noisy, out-of-domain (e.g., code-mixed) data, and gaps in handling Roman-script or mixed-language sequences (Velankar et al., 2022). The hate-speech datasets are artificially balanced, differing from organic social data (Velankar et al., 2022). Current work omits advanced universal sentence embedding pretraining (e.g., contrastive objectives or cross-lingual alignment); ongoing research targets robustness to such domains, curriculum and adapter-based fine-tuning, and expansion to new Indo-Aryan scripts.
All MahaBERT models, training scripts, and major Marathi NLP datasets (MahaCorpus, MahaSent, MahaHate, MahaNER, MahaNews, etc.) are open-source and actively maintained by L3Cube Pune [https://github.com/l3cube-pune/MarathiNLP, (Joshi, 2022, Patil et al., 2022, Mittal et al., 2024)]. This has positioned MahaBERT as a de facto reference for research and production deployment in Marathi language technology, and as an exemplar for resource creation in other low-resource linguistic contexts.