Krutrim LLM: India's Multilingual AI

Updated 16 May 2026

Krutrim LLM is a multilingual foundation model designed to address data scarcity in Indian languages with over 2 trillion tokens.
It employs advanced techniques such as ALiBi, Grouped Query Attention, and DPO-based alignment to enhance efficiency and ethical use.
The model shows competitive performance on both Indic and English benchmarks with real-time retrieval-augmented generation ensuring factual accuracy.

Krutrim LLM is a multilingual foundational LLM developed for broad linguistic coverage in the Indian context, incorporating over 2 trillion tokens with the largest known Indic language dataset to date. It is designed for scalability across hundreds of dialects, addressing the data scarcity and linguistic bias characteristic of global LLMs trained predominantly on English. Krutrim employs architectural and alignment techniques specifically tailored to the diverse morphologies and sociolinguistic features of Indic scripts, while maintaining competitive performance on English and multilingual benchmarks. Its deployment emphasizes ethical data curation, cultural sensitivity, and factual reliability through real-time retrieval-augmented generation.

1. Architectural Specifications and Training Objective

Krutrim LLM is implemented as a decoder-only transformer with approximately 7 billion parameters over 32 layers, each with a hidden dimension size of 4,608. Attention is distributed over 48 heads, with 8 KV heads, and a context window of 4,096 tokens. Positional encoding utilizes ALiBi, facilitating extended context length with minimal parameter cost. The attention mechanism is based on Grouped Query Attention (GQA), which reduces the KV-cache footprint and enhances inference efficiency. Activation functions are ReLU, with QKV-matrix clipping applied to promote numerical stability during training.

Key Model Parameters

Attribute	Value
Layers	32
Hidden dimension	4,608
Attention heads	48
KV heads	8
Sequence length (tokens)	4,096

The training objective is next-token prediction via the cross-entropy loss:

$\mathcal{L}_{\mathrm{CE}} = -\sum_{t=1}^{T} \log P(w_{t}\mid w_{<t}; \theta)$

For alignment, Direct Preference Optimization (DPO) replaces PPO-based RLHF approaches.

2. Training Data Construction and Preprocessing

Krutrim’s training corpus consists of 2 trillion tokens, with hundreds of billions from major Indic languages (Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Marathi, Kannada, Sanskrit), including substantial code-mixed data; the remainder comprises English and other global languages. Data aggregation involved web scraping and the integration of open datasets such as RedPajama, Books, PubMed, Wiki, StackFast, and NDL.

Data curation follows a multi-step pipeline: removal of duplicates, filtering of low-quality and extremely short passages, and data cleaning inspired by Dolma and IndicLLMSuite. Tokenization is performed by a custom SentencePiece BPE model trained jointly on Indic and English, specifically minimizing token-to-word ratios in highly inflected Indic scripts.

Data sparsity is mitigated by up-sampling low-resource Indic languages, down-sampling English, and employing curriculum learning that mixes sequence lengths and languages—thereby supporting structural linguistic robustness.

3. Training Process and Efficiency Optimizations

Training utilized a compute budget of approximately $10^{23}$ FLOPs on NVIDIA H100 GPUs. Sequence lengths are standardized at 4,096 tokens, with batch sizes adjusted for optimal hardware utilization. Optimization uses AdamW with weight decay, a linear warm-up over ~10,000 steps, and peak learning rate of ≈ $1 \times 10^{-4}$ , followed by cosine decay (learning rates inferred from prevailing best practices in LLM pre-training).

Efficiency is achieved through:

ALiBi for scalable context extension.
GQA for memory-efficient attention.
QKV clipping for stable numerical computation.

Continual Pre-training (CPT) incorporates a 25:75 ratio of original to new domain/language data, resuming the pre-training learning rate schedule. Empirically, CPT enhances downstream SFT performance, as charted in Figure 1 of the source paper (Kallappa et al., 10 Feb 2025).

4. Evaluation: Indic and English Benchmarks

Krutrim is evaluated on AI4Bharat’s IndicXtreme suite and 17 English tasks, using generative BERTScore and classification accuracy as primary metrics.

Indic Benchmarks Example: IndicCOPA (BERTScore, three-shot)

Model	bn	gu	hi	kn	ml	mr	ta	te
Krutrim LLM	0.89	0.83	0.86	0.88	0.88	0.87	0.89	0.89
GPT-3.5	0.77	0.73	0.77	0.74	0.75	0.70	0.72	0.75

English Benchmarks (Accuracy across 17 tasks, comparison with LLaMA-2 Chat SFT):

Task	LLaMA-2 Chat SFT	Krutrim LLM
ARC	0.517	0.587
BoolQ	0.803	0.854
COPA	0.780	0.860
Winogrande	0.681	0.702
Average	0.552	0.569

Krutrim matches or exceeds LLaMA-2 across 10 of 16 tasks with an average score of 0.569 (vs. 0.552). Qualitative human evaluation using Mean Opinion Score (MOS) after CPT demonstrates subjective improvements in output quality.

5. Retrieval-Augmented Generation and Factual Grounding

Krutrim’s conversational interface integrates a real-time search client (WebRAG) for retrieval-augmented generation. After supervised fine-tuning, the model is explicitly instructed to:

Restrict answers to the contents of retrieved documents.
Report ambiguity or inconsistency, avoiding hallucination.
Refuse to answer if no supporting evidence is found.

Retrieval-Augmented Pseudocode

docs ← WebSearch(query)
prompt ← FormatContext(query, docs)
response ← Krutrim.generate(prompt)
if not ResponseSupportedBy(docs):
    response ← “I’m sorry, I could not find evidence to answer that.”

Deployment metrics indicate factual accuracy increased from 68.7% at launch to 79.1% following targeted SFT (Kallappa et al., 10 Feb 2025).

6. Ethical, Linguistic, and Cultural Considerations

Krutrim addresses Indic underrepresentation by tailoring its tokenizer to morphologically rich scripts and scaling up low-resource dialects and code-mixed corpora. Data balancing is enforced through up-sampling and down-sampling protocols. Training encompasses socio-economic and cultural diversity, with DPO-based alignment emphasizing safety and sensitivity to contextually delicate topics.

Limitations include diminished performance on extremely low-resource languages (e.g., Sanskrit) and incompleteness in oral-tradition capture. Future directions focus on expanding to 22+ Indic languages through CPT, deeper oral tradition modeling, broader code-mixed context understanding, and domain adaptation for fields such as law and medicine.

7. Conclusions and Applications

Krutrim LLM stands as India’s premier large-scale multilingual foundation model, capable of serving a linguistically and culturally diverse population exceeding a billion individuals. Architectural innovations (ALiBi, GQA), tokenizer design, and retrieval-augmented alignment establish a template for globally inclusive LLM development. Use cases span conversational assistance, education, translation, recommendation systems, and specialized domain advising, accessible via a publicly available chat interface (https://chat.olakrutrim.com) (Kallappa et al., 10 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Krutrim LLM: Multilingual Foundational Model for over a Billion People (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Krutrim LLM.

Krutrim LLM: India's Multilingual AI

1. Architectural Specifications and Training Objective

2. Training Data Construction and Preprocessing

3. Training Process and Efficiency Optimizations

4. Evaluation: Indic and English Benchmarks

5. Retrieval-Augmented Generation and Factual Grounding

6. Ethical, Linguistic, and Cultural Considerations

7. Conclusions and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Krutrim LLM: India's Multilingual AI

1. Architectural Specifications and Training Objective

2. Training Data Construction and Preprocessing

3. Training Process and Efficiency Optimizations

4. Evaluation: Indic and English Benchmarks

5. Retrieval-Augmented Generation and Factual Grounding

6. Ethical, Linguistic, and Cultural Considerations

7. Conclusions and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research