MahaCorpus: Marathi NLP Resource
- MahaCorpus is a comprehensive monolingual Marathi text dataset comprising 24.8 million sentences and up to 752 million tokens when combined with existing resources.
- It is constructed using systematic web scraping and cleaning methods, including HTML stripping and Unicode normalization, to ensure high data quality.
- The corpus is divided into news and non-news sub-corpora, featuring a broad vocabulary that supports robust transformer-based models and a variety of NLP tasks.
MahaCorpus is a large-scale, open-access monolingual Marathi text corpus developed by the L3Cube group for language modeling and downstream NLP tasks. Encompassing 24.8 million sentences and approximately 289 million tokens—expanded to 752 million tokens when combined with pre-existing Marathi resources—it constitutes the principal input for a new generation of transformer-based models and lexical resources for Marathi. The collection is designed to be domain-diverse and is openly released via the L3Cube-MahaNLP repository, supporting both foundational and applied research in low-resource Indian language processing (Joshi, 2022, Joshi, 2022, Chavan et al., 2024).
1. Corpus Composition and Structure
MahaCorpus aggregates monolingual Marathi data across two principal sources: news (predominantly Maharashtra Times) and non-news web/literature (notably netshika.com “sangrah” pages). The corpus comprises 17.6 million news sentences (212 million tokens) and 7.2 million non-news sentences (76.4 million tokens), totaling 24.8 million sentences (289 million tokens). When combined with other public Marathi text, the aggregate rises to 57.2 million sentences (752 million tokens) (Joshi, 2022, Joshi, 2022).
The corpus is distributed in two independent sub-corpora—news and non-news—to facilitate domain-specific modeling. The empirical vocabulary spans on the order of 200,000–300,000 unique Devanagari word-forms; the rank–frequency curve conforms to Zipf’s law with , indicating a broad lexical base and natural language-like statistical profile.
| Source Type | Sentences | Tokens | Proportion |
|---|---|---|---|
| News | 17.6 million | 212 million | ~73.4% |
| Non-news | 7.2 million | 76.4 million | ~26.6% |
The corpus includes no linguistic annotation besides the basic news/non-news classification. No part-of-speech tags, named entities, or document-level metadata are encoded in the raw corpus (Joshi, 2022).
2. Data Acquisition and Preprocessing
MahaCorpus is sourced through web scraping, utilizing BeautifulSoup for static HTML and Selenium for JavaScript-heavy sites (Joshi, 2022). Typical scraping targets are organized under two buckets (news and non-news), but the paper does not enumerate the complete list of domains or specify crawling schedules (Joshi, 2022).
Cleaning uses a combination of regular expressions and Unicode normalization (NFC), with processes for HTML stripping, Devanagari character filtering, and the removal of URLs, emails, user-mentions, and emojis. Sentence boundary detection is performed using Devanagari-specific punctuation (।, ॥) alongside ., ?, !, ensuring each sentence is stored on a single line (Joshi, 2022). Tokenization for corpus statistics relies on whitespace splitting, but downstream pretraining employs byte-pair encoding (BPE) or SentencePiece, yielding 30,000–50,000 subwords.
Some preprocessing steps such as duplicate removal, language-ID filtering, and orthographic normalization are implied but not exhaustively detailed (Joshi, 2022). For TF–IDF workflows in stopword curation, the full corpus is partitioned into 20 equal-sized chunks (each ~1.24 million sentences) for tractability (Chavan et al., 2024).
3. Licensing, Format, and Access
MahaCorpus is distributed via the L3Cube-MahaNLP GitHub repository (https://github.com/l3cube-pune/MarathiNLP) in plain-text Devanagari UTF-8 format, typically compressed as .txt.gz files, with directories separating news and non-news content (Joshi, 2022). All data are published under the Creative Commons BY-SA 4.0 license (Joshi, 2022). No explicit train/dev/test splits are provided; the released structure allows arbitrary downstream splits, with an example 80%/10%/10% partition suggested but not mandated in the primary documentation.
Download and loading procedures are standard for large corpora, supporting both shell-based cloning and line-by-line loading in Python. Pre-built HuggingFace "datasets" interfaces are provided for direct integration with neural pipelines.
4. Benchmarking, Language Modeling, and Applications
MahaCorpus serves as the foundational data source for all L3Cube Marathi LLMs, including MahaBERT, MahaRoBERTa, MahaAlBERT (BERT variants), MahaGPT (GPT-2 style causal LM), and MahaFT (FastText embeddings) (Joshi, 2022).
Model architecture profiles:
- MahaBERT: 12-layer transformer, 768 hidden size, 12 attention heads, ~32k vocab, ~110M parameters.
- MahaGPT: small/medium GPT-2 configuration (12–24 layers), custom BPE vocabulary.
Training hyperparameters follow standard regimes for BERT-base models: batch size 256–512 tokens, learning rate (with linear decay), 10% warmup, 500k steps, and multi-GPU training on V100/A100 hardware. Reported perplexity and full training details are not documented in the source papers.
Downstream tasks benchmarked include:
- News-article classification (sports/entertainment/lifestyle)
- Headline classification (entertainment/sports/state)
- Tweet sentiment analysis (positive/negative/neutral)
- Named Entity Recognition (PER/LOC/ORG)
Performance gains over multilingual baselines (mBERT, XLM-R, indicBERT) are observed, with MahaBERT, MahaRoBERTa, and MahaAlBERT achieving up to +1.8% accuracy for tweet sentiment and up to +2.02 in NER macro-F1 over baselines (Joshi, 2022). FastText embeddings trained on MahaCorpus (MahaFT) outperform both INLP-FT and Facebook FastText on classification.
Stopword curation using TF–IDF on MahaCorpus yielded a validated list of 400 Marathi stopwords, the largest to date. Stopword removal produces negligible loss in classification accuracy (<0.5%) and mildly increases sentiment classification performance for MahaBERT (Chavan et al., 2024).
5. Statistical and Linguistic Profile
With 24.8 million sentences and an estimated 200,000–300,000 unique word-forms, the surface vocabulary of MahaCorpus offers wide coverage of everyday and formal Marathi. Frequency analysis closely matches expected laws (Zipf), and sentence length distribution spans short headlines (3–5 tokens) to extended expository sentences (>50 tokens) (Chavan et al., 2024). For TF–IDF stopword ranking, the intersection of the lowest-scoring 5,000 terms in each chunk identified 2,297 uniformly frequent, low-content words; subsequent manual vetting distills a final 400-word stoplist.
Corpus structure and size enable domain-adaptive modeling and holistic evaluation, critical for robust low-resource language NLP.
6. Limitations and Future Directions
Several steps in corpus construction, such as explicit duplicate removal, advanced noise filtering (e.g., code-mixed segment exclusion), and finer-grained metadata collection (such as document IDs, timestamps, authorship), are not fully documented (Joshi, 2022). The raw corpus encodes only basic domain metadata (news vs. non-news), with no linguistic annotation.
Potential directions for expansion and refinement include:
- Integrating social-media text (e.g., Twitter, Facebook) to better encompass colloquial and code-switched Marathi
- Adding domain-specific subcorpora (medical, legal, academic)
- Implementing continuous crawling pipelines for temporal coverage
- Applying advanced cleaning (parallel language-ID classifiers, normalization, iterative human-in-the-loop validation)
These enhancements could further improve the resource for specialized downstream tasks and adaptive language modeling.
7. Access and Community Adoption
MahaCorpus and all associated models, benchmarks, and stopword lists are freely available at https://github.com/l3cube-pune/MarathiNLP. The resource has become the de facto standard for Marathi monolingual pretraining and evaluation. Integration with HuggingFace datasets and models streamlines experimentation. The L3Cube-MahaNLP initiative’s open-source approach has established a replicable framework for other low-resource Indian languages (Joshi, 2022, Joshi, 2022, Chavan et al., 2024).