MahaBERT: Marathi NLP Transformer

Updated 20 February 2026

MahaBERT is a BERT-style transformer model tailored for Marathi, a morphologically rich and low-resource language.
It leverages aggressive layer pruning and specialized pretraining on large Marathi corpora to enhance performance in tasks like text classification, sentiment analysis, and NER.
Empirical results show that both full and pruned variants outperform multilingual baselines, establishing MahaBERT as a leading solution in efficient low-resource NLP.

MahaBERT is a family of BERT-style pretrained transformer models tailored for Marathi, a morphologically rich, low-resource Indo-Aryan language. Designed by the L3Cube Pune research group, MahaBERT and its derivatives represent the foundation for high-accuracy Marathi NLP applications, offering strong empirical gains over multilingual BERT models for text classification, sentiment analysis, named entity recognition (NER), sentence embeddings, and hate speech detection. The MahaBERT line, including MahaBERT-v2, MahaBERT-Small, and MahaBERT-Smaller, is central to research on efficient model adaptation via aggressive layer pruning in resource-constrained settings (Shirke et al., 1 Jan 2025), and has catalyzed the emergence of Marathi-specific NLP datasets, pipelines, and evaluation standards (Joshi, 2022, Patil et al., 2022, Mittal et al., 2024, Pingle et al., 2023).

1. Architecture and Pretraining Regimen

The MahaBERT architecture strictly follows the BERT-Base configuration:

12 Transformer encoder layers ( $L=12$ )
Hidden state dimensionality $H=768$
12 self-attention heads per layer
Intermediate feed-forward size = 3072
Total parameter count $\approx$ 110 million
Tokenization: WordPiece vocabulary of approximately 30,000 subwords, cased, constructed from large-scale monolingual Marathi corpora
Embeddings: token, segment, position (all 768-dim); input = sum of their respective vectors per token

Pretraining is exclusively Marathi: corpus sources include the L3Cube-MahaCorpus (scraped web/news/documents, 212M news + 76.4M non-news tokens), Wikipedia archives, OSCAR/CC-100, Netshika, and the IIT Bombay parallel Marathi corpus, for an aggregate of $\approx$ 752 million tokens (Joshi, 2022). The primary pretraining objective is Masked Language Modeling (MLM); Next Sentence Prediction (NSP) is used in some variants (notably MahaBERT-v2 and "MarathiBERT V2" (Chavan et al., 2022)), but omitted in earlier and RoBERTa-style variants (Joshi, 2022, Patil et al., 2022).

The training schedule for MLM is standard: AdamW optimizer ( $\beta_1=0.9,\,\beta_2=0.999$ , weight decay $0.01$), learning rate $2\times10^{-5}$ , batch size 64–256, maximum sequence length 512, typical up to 2 epochs over the full corpus (Joshi, 2022, Joshi, 2022). Downstream tasks adopt application-specific heads (single-layer softmax for classification, token-classification for NER), trained with cross-entropy loss functions.

Model	Layers	Hidden	Heads	Parameters	Corpus
MahaBERT	12	768	12	~110M	~750M tokens
MahaBERT-v2	12	768	12	~110M	≥1B tokens
MahaBERT-Small	6	768	12	~55M	scratch-trained
MahaBERT-Smaller	2	768	12	~18M	scratch-trained

2. Layer Pruning and Model Compression

MahaBERT models are a principal testbed for structured layer pruning, targeting practical deployment for low-resource languages. The formalism: for an $L$ -layer model, prune $p$ layers to produce an $L' = L-p$ model. Three positional strategies are used:

Top-pruning: remove final $H=768$ 0 layers ( $H=768$ 1 to $H=768$ 2)
Bottom-pruning: remove initial $H=768$ 3 layers (1 to $H=768$ 4)
Middle-pruning: remove contiguous block of $H=768$ 5 layers centered within the stack

Experiments typically prune 6 ( $H=768$ 6) or 10 ( $H=768$ 7) layers from MahaBERT-v2 ( $H=768$ 8), yielding 6-layer and 2-layer models. This approach yields significant resource savings: $H=768$ 9 reduces model size and compute by $\approx$ 050%, $\approx$ 1 by $\approx$ 283% (Shirke et al., 1 Jan 2025, Shelke et al., 2024). Inference speedup is empirically measured at $\approx$ 31.8 $\approx$ 4 (6-layer) and %%%%25 $\beta_1=0.9,\,\beta_2=0.999$ 26%%%% (2-layer) with similar proportional reductions in FLOPs.

Performance remains competitive post-pruning: pruned MahaBERT-v2 models (6/2 layers) maintain accuracy and embedding quality close to the full 12-layer baseline, and consistently outperform scratch-trained MahaBERT-Small/Smaller at identical size (Shirke et al., 1 Jan 2025, Shelke et al., 2024). Middle-pruning, while effective, is generally matched or outperformed by top-pruning for sentence encoding (Shelke et al., 2024), but optimal strategy is task-dependent.

3. Downstream Tasks and Empirical Results

MahaBERT and its pruned variants are evaluated on a wide array of Marathi NLP tasks. Key benchmarks include:

Topic Classification: L3Cube-MahaNews (Mittal et al., 2024), SHC (short headlines), LPC (paragraphs), LDC (long documents)—all 12-way multi-class tasks
Sentiment Analysis: L3Cube-MahaSent and MahaSent-MD (multi-domain) (Pingle et al., 2023)
Hate Speech Detection: L3Cube-MahaHate (Velankar et al., 2022), HASOC (Chavan et al., 2022)
Named Entity Recognition: L3Cube-MahaNER (Patil et al., 2022)
Sentence Embeddings: MahaSBERT/STS (Joshi et al., 2022, Shelke et al., 2024)

Performance consistently shows monolingual MahaBERT variants surpassing multilingual baselines (mBERT, IndicBERT, MuRIL, XLM-R) across all tasks, often by 1–3 points in accuracy or macro-F1, with larger margins on NER and classification (Joshi, 2022, Patil et al., 2022, Shirke et al., 1 Jan 2025). For hate/offensive tasks, MahaBERT is marginally outperformed only by tweet-domain-specialized MahaTweetBERT, but consistently bests MuRIL on general data (Chavan et al., 2022).

Layer pruning has limited adverse effect on end-task accuracy. For example, a MahaBERT-v2 model pruned to 6 layers via top/middle/bottom strategies achieves absolute accuracy up to 92% (SHC), 91% (LPC), and 90% (LDC), matching or exceeding full-size baselines on specific datasets (Shirke et al., 1 Jan 2025).

Model	SHC	LPC	LDC
MahaBERT-v2 (12L)	91.41	88.75	94.78
MahaBERT-v2 Top6 (6L)	92.18	90.80	89.35
MahaBERT-Small (6L)	88.81	89.46	85.04

In sentence embedding evaluation (intrinsic: STS Spearman ρ; extrinsic: downstream kNN) pruned MahaBERT-v2 (6L, 2L) with SBERT-style finetuning achieves up to 90–95% of full-model performance (ρ=0.7878 for 6L, vs. 0.8320 for 12L after two-phase fine-tuning) (Shelke et al., 2024, Joshi et al., 2022).

4. Sentence Embedding and MahaSBERT

MahaBERT forms the backbone of Marathi sentence-BERT (SBERT) adaptations, supporting strong semantic similarity and retrieval (Joshi et al., 2022, Shelke et al., 2024). Training uses a two-phase approach:

Stage 1: NLI fine-tuning (MultipleNegativesRankingLoss)
Stage 2: STS fine-tuning (CosineSimilarityLoss)

For sentence embedding extraction, mean/CLS/max pooling strategies are compared; mean pooling on the final layer generally yields the best results for monolingual models. Two-step NLI→STS training with MahaBERT consistently achieves higher STS correlation (ρ=0.83) than one-step or multilingual alternatives (Joshi et al., 2022). Pruned models (6L/2L) post-fine-tuning retain nearly all downstream utility (e.g., ρ=0.7878 for 6L) (Shelke et al., 2024).

Layer pruning yields efficient, deployable SBERT models suitable for Marathi-language search and clustering, with pruned MahaBERT-based SBERT outperforming both small scratch-trained and multilingual embedding models at similar parameter budgets (Shelke et al., 2024).

5. Embedding Efficiency: Contextual and Non-Contextual Features

MahaBERT supports both contextual BERT embeddings (full layer stack inference) and non-contextual embeddings (first-layer sum: $\approx$ 7), the latter obviating transformer computation. Non-contextual MahaBERT embeddings, when averaged over tokens, outperform FastText and other non-contextual baselines across sentiment and topic classification, approaching the accuracy of contextual models with much lower runtime cost (Shanbhag et al., 2024). However, contextual (full-stack) representations are optimal when compute is not a constraint.

Embedding Type	MahaSent	SHC	LDC
Contextual MahaBERT	82.27	89.83	93.87
Non-contextual MahaBERT	77.56	86.45	91.69
FastText MahaFT	78.62	85.89	92.62

6. Comparative Analysis and Impact in Low-Resource Settings

The repeated empirical finding is that monolingual, corpus-specialized BERTs (MahaBERT family) provide significant and consistent accuracy gains in downstream tasks relative to their multilingual counterparts (Joshi, 2022, Velankar et al., 2022, Shirke et al., 1 Jan 2025). This is attributed to a larger, higher-quality vocabulary adapted to Marathi morphology, improved subword/compound representations, and an absence of cross-lingual negative transfer.

Layer-pruning methodology provides a principled means to realize these gains with reduced compute and storage requirements, a key concern for Marathi and similar languages. Pruned MahaBERT variants outperform equivalently sized scratch-trained models in both classification and sentence encoding tasks, showing that pretraining depth and capacity are effectively leveraged even after model reduction (Shirke et al., 1 Jan 2025, Shelke et al., 2024).

7. Limitations, Future Directions, and Public Resources

While MahaBERT consistently sets state-of-the-art baselines for Marathi-topic, sentiment, and hate-speech classification, known limitations include limited generalizability of "freeze" sentence embeddings to noisy, out-of-domain (e.g., code-mixed) data, and gaps in handling Roman-script or mixed-language sequences (Velankar et al., 2022). The hate-speech datasets are artificially balanced, differing from organic social data (Velankar et al., 2022). Current work omits advanced universal sentence embedding pretraining (e.g., contrastive objectives or cross-lingual alignment); ongoing research targets robustness to such domains, curriculum and adapter-based fine-tuning, and expansion to new Indo-Aryan scripts.

All MahaBERT models, training scripts, and major Marathi NLP datasets (MahaCorpus, MahaSent, MahaHate, MahaNER, MahaNews, etc.) are open-source and actively maintained by L3Cube Pune [https://github.com/l3cube-pune/MarathiNLP, (Joshi, 2022, Patil et al., 2022, Mittal et al., 2024)]. This has positioned MahaBERT as a de facto reference for research and production deployment in Marathi language technology, and as an exemplar for resource creation in other low-resource linguistic contexts.