Indic Language Models: Overview & Advances
- Indic Language Models (ILMs) are transformer-based neural models tailored for the diverse linguistic, script, and morphological features of South Asian languages, supporting both monolingual and multilingual setups.
- They employ advanced data curation, tokenization, and training paradigms to mitigate challenges like severe data scarcity, orthographic complexity, and cross-script generalization.
- ILMs enhance natural language understanding and generation across tasks through robust evaluation benchmarks, bias mitigation strategies, and innovative resource-efficient model architectures.
Indic LLMs (ILMs) are neural LLMs—predominantly based on Transformer architectures—designed for the diverse linguistic, script, and morphological landscape of South Asian “Indic” languages. ILMs cover both monolingual and multilingual setups, address over twenty official Indian languages across numerous scripts, and span foundational models for natural language understanding (NLU), generation (NLG), and multitask instruction following. The development and evaluation of ILMs raise unique challenges compared to high-resource languages, such as severe data scarcity in tail languages, orthographic complexity, and the need for cross-script and cross-family generalization.
1. Pretraining Corpora and Data Engineering
ILM performance fundamentally depends on the scale, diversity, and cleanliness of pretraining corpora. Leading efforts have targeted corpus curation for large-scale, script-diverse, and culturally varied data sources.
- IndicCorp v2 (Doddapaneni et al., 2022): A monolingual corpus encompassing 24 languages across four language families, totaling 20.9B tokens (1.1B sentences). Sources include news crawls, Wikipedia, and the OSCAR corpus, with script-level filtering, language identification (LID), toxic content removal, and document-level deduplication.
- Krutrim LLM (Kumar et al., 2024): Petabyte-scale web and text data ingestion (Common Crawl, digital news, books, Wikipedia), filtered via percentile-based heuristics on token and sentence lengths, perplexity scoring via small LMs, duplicate detection by 5-gram shingling and MinHash LSH (approximate Jaccard similarity threshold θ = 0.7). Example: for Hindi, 108M tokens were reduced to 51.2M post-deduplication and full filtering.
- Token/word-level distribution in public sources is highly imbalanced (e.g., Hindi yields 0.197% of Common Crawl tokens vs. <0.05% for languages like Assamese, Odia) (Vaidya et al., 23 Jan 2025). Thus, for low-resource languages, synthetic data generation, targeted web crawls, and careful upsampling or balancing strategies are critical.
A typical data curation pipeline includes:
- Multi-source text extraction (web, news, books, Wikipedia).
- Language-ID at paragraph or page level (e.g., FastText or cld3).
- Script/Unicode block enforcement.
- Heuristic and perplexity-based filtering.
- Duplicate removal at document and line level.
- Tokenizer-driven normalization (Unicode NFKC, script mapping). This systematic approach yields corpora robust both to intra-family code-switching and to diverse orthographies.
2. Tokenization Strategies for Indic Languages
Tokenizer efficacy is a critical factor in ILMs, given the orthographic and morphological richness of Indic scripts.
- IndicSuperTokenizer (IST) (Rana et al., 5 Nov 2025): Proposes a two-stage curriculum—Stage 1: standard whitespace-constrained BPE (≈90% vocab), preserving inflectional/morphological subwords; Stage 2: unconstrained multi-word “superword” merges, with sentence-boundary constraints to avoid inter-sentential merges. Regular-expression-based pretokenization (adopting LLaMA-4 regex) and Unicode NFKC normalization are shown to significantly reduce fragmentation (Stage 1 fertility drops from 4.29 [GPT-2 rules] to 2.03).
- Krutrim LLM (Kumar et al., 2024): Trains SentencePiece-BPE tokenizers with up to 100k vocabulary per language, with coverage >99.7%, and demonstrates token-to-word (fertility) ratios 14–29% lower than OpenAI’s Tiktoken for most Indic languages.
- Paramanu: mBharat tokenizer (Niyogi et al., 2024): Mixes BPE and Unigram subword units, script-run segmentation, and byte-level fallbacks for low-frequency scripts, maintaining fertility ∼1.1 tokens/char and supporting Roman-script code-mixing.
Empirically, IST achieves average fertility F ≈ 1.83 (vs. 3.03 for LLaMA-4), translating to a 44% increase in token throughput (OTPS) on LLaMA-3.2 1B models. Ablation studies confirm performance plateaus beyond 10GB training corpora and show diminishing returns for vocabularies >200k tokens.
3. Model Architectures and Training Paradigms
Indic LMs span a range of encoder-only, decoder-only, and encoder-decoder architectures:
- Encoder-only models: BERT, mBERT, IndicBERT v2 (Doddapaneni et al., 2022), MuRIL, XLM-RoBERTa (Jain et al., 2020).
- Decoder-only models: BLOOM (Vaidya et al., 23 Jan 2025), Paramanu (13M–367.5M parameters, context size 1024, supports monolingual/multilingual/bilingual variants) (Niyogi et al., 2024).
- Instruction-tuned and multitask LLMs: Trained with ∼23k per-language instruction examples (human-authored, back-translation, self-instruct) (Niyogi et al., 2024).
- Multilingual typology-aware pretraining: Paramanu groups languages by script and typology to mitigate “curse of multilinguality” and improve transfer, e.g., grouping Hindi/Konkani/Maithili (Devanagari) vs. Odia, Bengali, Assamese.
Typical training hyperparameters follow standard AdamW, batch-size scaling by available GPU memory, and temperature-based upsampling of tail languages (Doddapaneni et al., 2022). Masked LM (MLM), Translation LM (TLM), and cross-lingual contrastive losses (XLCO) are commonly employed (MuRIL, InfoXLM, XLM-R, IndicBERT v2).
4. Evaluation Benchmarks and Linguistic Probing
Comprehensive evaluation relies on both multitask NLU/NLG benchmarks and fine-grained linguistic property probes.
- Natural language tasks: Classification (IndicSentiment, IndicXParaphrase), reasoning (IndicCOPA), extraction (NER: Naamapadam, slot fill), QA (IndicQA), retrieval (FLORES) (Doddapaneni et al., 2022). Metrics: accuracy, F1, BLEU, ROUGE, perplexity.
- IndicXTREME (Doddapaneni et al., 2022): 105 evaluation sets over 20 languages, with 52 newly built test sets.
- IndicMMLU-Pro (KJ et al., 27 Jan 2025): Extends MMLU-Pro to nine Indic languages across comprehension, reasoning, generation (≈11.4k questions/language), employing IndicTrans2 MT for high-fidelity translations (e.g., Hindi chrF++=78.06, BLEU=0.59).
- IndicSentEval (Aravapalli et al., 2024): 47k sentences, 6 languages, 8 probing tasks (surface, syntactic, semantic), 13 input perturbations. Indic-specialized models (IndicBERT, MuRIL) outperform universal models on syntactic/semantic encoding for Indic languages, but universal models (mT5, mGPT, XGLM) show greater robustness to input perturbations (cf. robustness score R > 0.9).
Table: Sample performance on IndicMMLU-Pro (Accuracy, excerpt) (KJ et al., 27 Jan 2025):
| Language | GPT-4o | Llama-3.1 | IndicBART | IndicBERT |
|---|---|---|---|---|
| Hindi | 44.8 | 18.61 | 11.21 | 10.78 |
| Bengali | 44.38 | – | 12.52 | 10.39 |
| Tamil | 38.46 | – | 11.70 | 10.96 |
Comprehensive analyses reveal a pronounced head–tail effect: top-five high-resource languages (Hindi, Bengali, Marathi, Telugu, Tamil) yield high NLU/NLG scores, whereas medium- and low-resource languages consistently underperform unless upsampled or grouped typologically (Vaidya et al., 23 Jan 2025, Doddapaneni et al., 2022). Synthetic-data generation and translation quality control are critical in these scenarios.
5. Adaptation to Low-Resource Indic Languages
Several methods have been proposed to extend ILM capabilities to low web-resource Indic languages (LRLs):
- RelateLM (Khemchandani et al., 2021): Exploits phylogenetic/script relatedness for adaptation by (a) transliterating LRL text to the script of a Related Prominent Language (RPL, e.g. Hindi/Devanagari), (b) constructing pseudo-parallel corpora with bilingual lexicons and word-by-word pseudo-translation, and (c) combining MLM and alignment loss for joint training.
Key results (NER F1, Hi-BERT→LRL adaptation):
| Language | Baseline | RelateLM | |----------|----------|----------| | Punjabi | 28.2 | 66.9 | | Gujarati | 14.8 | 39.7 | | Bengali | 34.0 | 57.3 |
Transliteration alone yields substantial gains versus English pivot or direct adaptation. Augmenting with alignment loss and dictionary-based pseudo-translation further boosts performance by 2–10 F1 points. Data scaling shows that RelateLM with 20k LRL docs can outperform EBERT trained on 4× more data.
6. Societal Bias and Fairness in ILMs
Gender and occupation bias present a significant concern in pre-trained ILMs, particularly for gendered languages like Hindi.
- Efficient Gender Debiasing (Kirtane et al., 2022): Defines an occupation-gender bias (OGB) metric using sentence templates with masked person/profession tokens, and demonstrates that efficient fine-tuning of MuRIL (freezing only layer-norms or extending to positional/word embeddings) reduces bias by up to 70–99% for masculine/feminine professions with minimal compute. Methodology is readily extensible to other Indic languages using language-specific template design and profession lists.
OGB example (mean score, neutral/fem/masc occupation nouns, baseline vs. LN unfrozen): | Category | Baseline | LN-only (reduction %) | |-------------|----------|----------------------| | Neutral | –2.575 | –0.788 (–69%) | | Feminine | –4.173 | –1.239 (–70%) | | Masculine | –1.382 | +0.007 (–99%) |
7. Open Challenges and Future Directions
ILMs continue to face several major research challenges:
- Data scarcity and imbalance: Beyond top-tier languages, both high-quality and volume are lacking; community-driven corpus creation and better LID for minor dialects remain urgent (Vaidya et al., 23 Jan 2025, Doddapaneni et al., 2022).
- Tokenizer design: Subword vocabularies should respect Indic morphological boundaries, mitigate fertility/fragmentation for agglutinative scripts, and support code-mixing, as established by IndicSuperTokenizer and mBharat (Rana et al., 5 Nov 2025, Niyogi et al., 2024).
- Cross-script and typological transfer: Techniques such as transliteration, script-unification (e.g. IndicBERT-SS, all-to-Devanagari) (Doddapaneni et al., 2022), and typology-driven multilingual pretraining (Paramanu) show promise for tail-language generalization.
- Benchmarking and evaluation: Large-scale, human-verified, culturally representative benchmarks (IndicMMLU-Pro, IndicXTREME, IndicSentEval) are central to measuring real progress (KJ et al., 27 Jan 2025, Doddapaneni et al., 2022, Aravapalli et al., 2024).
- Safety, fairness, and cultural nuance: Representative safety benchmarks, toxic content measurement, and mitigation of societal biases must be developed specifically for the Indic context (Vaidya et al., 23 Jan 2025, Kirtane et al., 2022).
- Resource-efficient LLMs: Paramanu demonstrates that with optimized tokenization and script/typology-aware modeling, sub-400M-parameter LLMs can exceed much larger cross-lingual models for generative tasks, enabling democratized access (Niyogi et al., 2024).
Consensus recommendations in the literature call for concerted, open-source initiatives in corpus building, tokenizer design, and benchmark creation, as well as for cross-disciplinary collaboration to ensure that future ILMs reflect the full spectrum of India’s linguistic and cultural complexity.