IndicTrans2: Multilingual Neural MT for Indian Languages
- IndicTrans2 is a family of multilingual neural machine translation models designed for Indian languages using extensive parallel corpora and unified transliteration techniques.
- It employs large-scale Transformers and innovative low-rank adaptation methods to balance synthetic and natural data, achieving state-of-the-art results on robust benchmarks.
- Open access and MIT-licensed, IndicTrans2 establishes strong baselines across high-resource and extremely low-resource settings, enhancing cross-lingual translation reliability.
IndicTrans2 is a family of large-scale multilingual neural machine translation (MNMT) models developed principally by AI4Bharat for translation among the 22 scheduled Indian languages and English. Engineered to address the lexical, orthographic, and data sparsity challenges posed by Indic languages, IndicTrans2 combines extensive corpus construction, rigorous filtering, direct and pivot-based translation, and architectural innovations. These models are open access, MIT-licensed, and serve as state-of-the-art baselines for both high-resource and extremely low-resource Indian languages, achieving robust performance exceeding or matching strong commercial and open models across diverse benchmarks (Gala et al., 2023).
1. Data Resources and Benchmark Construction
The enabling resource for IndicTrans2 is the Bharat Parallel Corpus Collection (BPCC), the largest publicly available dataset for Indian language MT, aggregating 230 million bitext pairs. BPCC integrates legacy resources (Samanantar, NLLB, ILCI), automated web-mined parallel pairs, and unique, high-quality manually translated subsets (BPCC-Human: 644K Wikipedia and 139K “daily-dialogue” sentence pairs). Extensive cleaning is performed, including LaBSE-based semantic similarity filtering (cosine ≥ 0.80) and margin scoring for comparable document mining. Back-translation from monolingual Indic and English sources augments corpus diversity.
Benchmarking is grounded on the IN22 suite, the first n-way parallel test covering all 22 scheduled languages and designed to emphasize source-original English, Indian-origin content, and conversational domains. Typical test subsets include IN22-Gen (1,024 sentences), IN22-Conv (1,503 multi-turn dialogue turns), and FLORES-200 for direct comparability with existing MT research. Evaluation relies on chrF++, BLEU, and human (XSTS) pairwise adequacy annotations, with COMET-22 DA supporting metrics on select languages (Gala et al., 2023).
2. Model Architectures and Training Paradigms
IndicTrans2 comprises several model variants that vary by scale, parameterization, and translation directionality:
- Large Core Models: 18-layer encoder and 18-layer decoder Transformers (d_model=1024, d_ff=8192, 16 attention heads), distinct SentencePiece vocabularies for Indic (128K) and English (32K), ∼1.1B parameters per model (Gala et al., 2023).
- Distilled Variants: Direct and many-to-many (M2M) models are distilled to ∼211M parameters (deep and thin: 18+18, d_model≈512), achieving comparable accuracy with significant inference efficiency gains (Gala et al., 2023, Betala et al., 10 Nov 2025).
- Parameter-efficient Tuning: For targeted adaptation, the 200M distilled model is fine-tuned using Low-Rank Adaptation (LoRA), introducing low-rank trainable adapters to attention projections:
with LoRA dropout 0.1 and total trainable parameters ≈0.8M (0.4% of 200M), optimizing standard cross-entropy with weight decay (Betala et al., 10 Nov 2025).
Core training consists of two stages: (1) pretraining on the full BPCC (or union with NLLB and Samanantar), (2) fine-tuning on manually validated seed data. Tagged back-translation (“bt”, “ft”) and curriculum learning are used to balance synthetic and natural data. The M2M model is constructed by composing pretrained monolingual models, then fine-tuned jointly on synthetic n-way parallel data (Gala et al., 2023, Wei et al., 2024).
3. Multilingual and Transfer Learning Approaches
IndicTrans2 supports all translation directions among its 22 Indic languages and English, with explicit support for both direct and pivot-based approaches. For Indic-to-Indic translation, both direct (M2M) and pivot (via English) are deployed. Language grouping strategies are empirically selected: Western Indo-Aryan (WI) languages benefit from related-language training; East Indo-Aryan (EI) do not, due to smaller corpora; for Dravidian, effects are mixed (Das et al., 2023).
Transfer learning further extends IndicTrans2 to extremely low-resource languages such as Assamese, Manipuri, Khasi, and Mizo. Here, fine-tuning from a pretrained IndicTrans2 backbone is coupled with data augmentation—including back-translation, forward-translation, and semantic data diversification—to yield BLEU improvements up to 47.9 for MN-EN and 31.8 for EN-MN, among others. R-Drop regularization and conservative LR scheduling are critical for stability when adapting to minimal bitext (Wei et al., 2024).
4. Script Unification and Transliteration
Given extreme script diversity across Indic languages (Devanagari, Bengali-Assamese, Gurmukhi, Dravidian, Odia, Roman), IndicTrans2 implements a unified ASCII transliteration based on modified ITRANS (for MNMT) or ISO 15919 (as recommended by empirical studies). All canonical forms are mapped to unique Latin-letter sequences, unifying subword distributions and vocabulary across scripts. Empirical evaluation demonstrates that transliteration significantly improves BLEU, especially in low-resource settings—typical gains are +1 to +7 BLEU, with exceptions in Malayalam and Tamil due to tokenization artifacts (Das et al., 2023, Moosa et al., 2022).
Transliteration consistently boosts cross-lingual alignment, as quantitatively measured by Centered Kernel Alignment (CKA) of hidden representations and statistically verified via Mann–Whitney U tests. High-resource languages remain unaffected, while low-resource languages see robust accuracy and F1-score increases (Moosa et al., 2022).
5. Specialized Pipelines and Downstream Integration
For real-world integration, IndicTrans2 serves as the backbone in complex pipelines such as the vision-augmented judge-corrector system for multimodal MT (Betala et al., 10 Nov 2025). Here, translation quality is assessed with an LLM-based “judge,” which dispatches poor translations (73.3% of detected errors) to IndicTrans2 for retranslation. Multimodal corrections requiring image-context are handled by LMMs. IndicTrans2 is accessed via standard seq2seq inference with language code specifications, and can be parameter-efficiently adapted on auto-corrected data using LoRA, yielding consistent BLEU score gains (e.g., +1.30 for EN→BN eval, +0.70 for EN→BN challenge) (Betala et al., 10 Nov 2025).
6. Empirical Performance and Analysis
IndicTrans2 establishes new state-of-the-art results across a variety of benchmarks and tasks:
- On IN22-Gen and FLORES-200, IndicTrans2 exceeds open and commercial MT baselines by 4–8 BLEU/chrF++ in EN→Indic and 1–5 in Indic→EN (Gala et al., 2023).
- For Indic↔Indic, direct M2M and pivoting via English achieve up to 24.29 BLEU (PA→HI on Flores-200 devtest). WI languages (HI, PA, GU, MR) systematically outperform EI and Dravidian groups due to larger data (Das et al., 2023).
- Seed-data fine-tuning produces the most significant gains on low-resource languages, with average increases of +2.1 chrF++ (EN→Indic) and +1.0 (Indic→EN) (Gala et al., 2023).
- Distilled and LoRA-tuned variants demonstrate that 200M–211M parameter models preserve nearly all accuracy of billion-parameter models with substantial efficiency improvements (Betala et al., 10 Nov 2025).
- Transfer learning on extremely low-resource languages outpaces training from scratch and comparable generic approaches (Wei et al., 2024).
7. Accessibility, Licensing, and Future Directions
IndicTrans2 models, corpora, benchmarks, and training code are released under MIT (models/code), CC BY 4.0 (human-translated data), and CC0 (mined data), with no restrictions on commercial usage. Resources are available at https://github.com/AI4Bharat/IndicTrans2 (Gala et al., 2023). Future work focuses on improving zero-shot transfer to unscheduled and under-represented languages, mitigating domain drift in synthetic data, scaling model architectures, and closing the delta to supervised settings for Sino-Tibetan and Austroasiatic families. Empirical evidence and CKA analyses suggest transliteration, curriculum learning, and targeted fine-tuning will remain central to further gains (Moosa et al., 2022, Das et al., 2023, Gala et al., 2023).