Krutrim LLM: India's Multilingual AI
- Krutrim LLM is a multilingual foundation model designed to address data scarcity in Indian languages with over 2 trillion tokens.
- It employs advanced techniques such as ALiBi, Grouped Query Attention, and DPO-based alignment to enhance efficiency and ethical use.
- The model shows competitive performance on both Indic and English benchmarks with real-time retrieval-augmented generation ensuring factual accuracy.
Krutrim LLM is a multilingual foundational LLM developed for broad linguistic coverage in the Indian context, incorporating over 2 trillion tokens with the largest known Indic language dataset to date. It is designed for scalability across hundreds of dialects, addressing the data scarcity and linguistic bias characteristic of global LLMs trained predominantly on English. Krutrim employs architectural and alignment techniques specifically tailored to the diverse morphologies and sociolinguistic features of Indic scripts, while maintaining competitive performance on English and multilingual benchmarks. Its deployment emphasizes ethical data curation, cultural sensitivity, and factual reliability through real-time retrieval-augmented generation.
1. Architectural Specifications and Training Objective
Krutrim LLM is implemented as a decoder-only transformer with approximately 7 billion parameters over 32 layers, each with a hidden dimension size of 4,608. Attention is distributed over 48 heads, with 8 KV heads, and a context window of 4,096 tokens. Positional encoding utilizes ALiBi, facilitating extended context length with minimal parameter cost. The attention mechanism is based on Grouped Query Attention (GQA), which reduces the KV-cache footprint and enhances inference efficiency. Activation functions are ReLU, with QKV-matrix clipping applied to promote numerical stability during training.
Key Model Parameters
| Attribute | Value |
|---|---|
| Layers | 32 |
| Hidden dimension | 4,608 |
| Attention heads | 48 |
| KV heads | 8 |
| Sequence length (tokens) | 4,096 |
The training objective is next-token prediction via the cross-entropy loss:
For alignment, Direct Preference Optimization (DPO) replaces PPO-based RLHF approaches.
2. Training Data Construction and Preprocessing
Krutrim’s training corpus consists of 2 trillion tokens, with hundreds of billions from major Indic languages (Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Marathi, Kannada, Sanskrit), including substantial code-mixed data; the remainder comprises English and other global languages. Data aggregation involved web scraping and the integration of open datasets such as RedPajama, Books, PubMed, Wiki, StackFast, and NDL.
Data curation follows a multi-step pipeline: removal of duplicates, filtering of low-quality and extremely short passages, and data cleaning inspired by Dolma and IndicLLMSuite. Tokenization is performed by a custom SentencePiece BPE model trained jointly on Indic and English, specifically minimizing token-to-word ratios in highly inflected Indic scripts.
Data sparsity is mitigated by up-sampling low-resource Indic languages, down-sampling English, and employing curriculum learning that mixes sequence lengths and languages—thereby supporting structural linguistic robustness.
3. Training Process and Efficiency Optimizations
Training utilized a compute budget of approximately FLOPs on NVIDIA H100 GPUs. Sequence lengths are standardized at 4,096 tokens, with batch sizes adjusted for optimal hardware utilization. Optimization uses AdamW with weight decay, a linear warm-up over ~10,000 steps, and peak learning rate of ≈, followed by cosine decay (learning rates inferred from prevailing best practices in LLM pre-training).
Efficiency is achieved through:
- ALiBi for scalable context extension.
- GQA for memory-efficient attention.
- QKV clipping for stable numerical computation.
Continual Pre-training (CPT) incorporates a 25:75 ratio of original to new domain/language data, resuming the pre-training learning rate schedule. Empirically, CPT enhances downstream SFT performance, as charted in Figure 1 of the source paper (Kallappa et al., 10 Feb 2025).
4. Evaluation: Indic and English Benchmarks
Krutrim is evaluated on AI4Bharat’s IndicXtreme suite and 17 English tasks, using generative BERTScore and classification accuracy as primary metrics.
Indic Benchmarks Example: IndicCOPA (BERTScore, three-shot)
| Model | bn | gu | hi | kn | ml | mr | ta | te |
|---|---|---|---|---|---|---|---|---|
| Krutrim LLM | 0.89 | 0.83 | 0.86 | 0.88 | 0.88 | 0.87 | 0.89 | 0.89 |
| GPT-3.5 | 0.77 | 0.73 | 0.77 | 0.74 | 0.75 | 0.70 | 0.72 | 0.75 |
English Benchmarks (Accuracy across 17 tasks, comparison with LLaMA-2 Chat SFT):
| Task | LLaMA-2 Chat SFT | Krutrim LLM |
|---|---|---|
| ARC | 0.517 | 0.587 |
| BoolQ | 0.803 | 0.854 |
| COPA | 0.780 | 0.860 |
| Winogrande | 0.681 | 0.702 |
| Average | 0.552 | 0.569 |
Krutrim matches or exceeds LLaMA-2 across 10 of 16 tasks with an average score of 0.569 (vs. 0.552). Qualitative human evaluation using Mean Opinion Score (MOS) after CPT demonstrates subjective improvements in output quality.
5. Retrieval-Augmented Generation and Factual Grounding
Krutrim’s conversational interface integrates a real-time search client (WebRAG) for retrieval-augmented generation. After supervised fine-tuning, the model is explicitly instructed to:
- Restrict answers to the contents of retrieved documents.
- Report ambiguity or inconsistency, avoiding hallucination.
- Refuse to answer if no supporting evidence is found.
Retrieval-Augmented Pseudocode
1 2 3 4 5 |
docs ← WebSearch(query) prompt ← FormatContext(query, docs) response ← Krutrim.generate(prompt) if not ResponseSupportedBy(docs): response ← “I’m sorry, I could not find evidence to answer that.” |
6. Ethical, Linguistic, and Cultural Considerations
Krutrim addresses Indic underrepresentation by tailoring its tokenizer to morphologically rich scripts and scaling up low-resource dialects and code-mixed corpora. Data balancing is enforced through up-sampling and down-sampling protocols. Training encompasses socio-economic and cultural diversity, with DPO-based alignment emphasizing safety and sensitivity to contextually delicate topics.
Limitations include diminished performance on extremely low-resource languages (e.g., Sanskrit) and incompleteness in oral-tradition capture. Future directions focus on expanding to 22+ Indic languages through CPT, deeper oral tradition modeling, broader code-mixed context understanding, and domain adaptation for fields such as law and medicine.
7. Conclusions and Applications
Krutrim LLM stands as India’s premier large-scale multilingual foundation model, capable of serving a linguistically and culturally diverse population exceeding a billion individuals. Architectural innovations (ALiBi, GQA), tokenizer design, and retrieval-augmented alignment establish a template for globally inclusive LLM development. Use cases span conversational assistance, education, translation, recommendation systems, and specialized domain advising, accessible via a publicly available chat interface (https://chat.olakrutrim.com) (Kallappa et al., 10 Feb 2025).