Qalb Model: Advanced Urdu NLP
- Qalb Model is a state-of-the-art Urdu large language model addressing the underrepresentation of Urdu in NLP with focused pre-training and fine-tuning.
- It employs a two-stage training pipeline—continued pre-training on a mixed Urdu-English corpus followed by supervised fine-tuning using LoRA for parameter efficiency.
- The model outperforms multilingual counterparts in tasks like generation, translation, and sentiment analysis by robustly handling Urdu’s complex morphology and Nastaliq script.
Qalb Model
Qalb is a state-of-the-art Urdu LLM explicitly designed to address the chronic underrepresentation of Urdu in contemporary NLP systems, despite its use by over 230 million speakers. Existing multilingual models, such as LLaMA-3.1 8B-Instruct, demonstrate poor performance on Urdu-specific tasks due to challenges in handling Urdu's complex inflectional morphology, right-to-left Nastaliq script, and its rich literary and domain-specific registers. Qalb combines systematic, large-scale Urdu-focused continued pre-training with targeted instruction fine-tuning, achieving a new state-of-the-art across core Urdu NLP benchmarks (Hassan et al., 13 Jan 2026).
1. Model Architecture and Training Pipeline
Qalb follows a two-stage adaptation pipeline based on LLaMA-3.1 8B, an open-source LLM:
- Stage 1: Continued Pre-training is conducted on a mixed Urdu–English corpus. This stage endows the model with deep knowledge of Urdu morphology, script, and various registers, while retaining foundational English capabilities via inclusion of English data as a replay buffer.
- Stage 2: Supervised Fine-Tuning transforms the continued pre-trained model into an Urdu instruction-following assistant using the Alif Urdu-instruct dataset.
Parameter-efficient fine-tuning is achieved via Low-Rank Adaptation (LoRA), allowing adaptation of ∼1.18B parameters (14.7% of the base) using a single NVIDIA A100 80 GB GPU.
Model Training Details
| Stage | Corpus / Dataset | Method | Main Hyperparameters |
|---|---|---|---|
| Continued Pre-training | 1.97B tokens (Urdu+English) | LoRA (rank 128) | LR=2×10⁻⁵ (emb: 2×10⁻⁶), bfloat16, batch=128, 7,500 steps |
| Supervised FT | Alif Urdu-instruct | LoRA (rank 128) | LR=5×10⁻⁵, 2 epochs, bfloat16, batch=64 |
The above configuration leverages the general reasoning/generation capabilities and parameter-efficient adaptation of the underlying LLaMA backbone (Hassan et al., 13 Jan 2026).
2. Pre-training Corpus Curation and Statistics
The pre-training corpus is constructed to maximize Urdu language coverage across formality, genre, and domain:
- Urdu Text (1.84B tokens):
- News archives: BBC Urdu, Jang, Dunya News, UrduPoint (~61M words)
- Literary corpora: Rekhta, Makhzan, Islamic books
- Specialized domains: sports, entertainment, health
- Colloquial: government documents, social media
- English Text (140M tokens): Wikipedia, used to prevent catastrophic forgetting of English during adaptation.
The processed corpus resulted in 5.04M documents (~9.09GB) after multi-stage cleaning (removal of boilerplate, short texts, duplicates, junk). Urdu word-purity is 95.31%, indicating minimal cross-language contamination.
3. Parameter-Efficient Training Strategy
LoRA-adapted pre-training and fine-tuning inject rank-128 adapters in all linear and embedding layers:
- Continued Pre-training: AdamW (8-bit), cosine decay (5% warmup), sequence length 2,048, bfloat16, effective batch size 128, 7,500 steps.
- Loss descended from 1.07 to 0.77; perplexity 2.35 to 2.20, demonstrating learning on Urdu data.
- Supervised Fine-Tuning: AdamW-8bit (0.01 weight decay), linear schedule (10-step warmup), batch size 64, epochs=2, bfloat16. LLaMA-3 chat format prompt with loss masking on user turns is used for instruction adaptation.
These choices reflect a balance between adaptation scale and hardware efficiency, suitable for accessible single-GPU setups.
4. Urdu NLP Benchmark Evaluation
Qalb is tested across seven Urdu-centric tasks using the Alif evaluation methodology, where GPT-4o automatically scores outputs against references in relevance, correctness, clarity, and formatting (0–10 scale). Human validation on a subset confirmed >85% judgment agreement.
| Task | Qalb Score (out of 100) |
|---|---|
| Generation | 85.97 |
| Translation | 94.41 |
| Ethics | 90.83 |
| Reasoning | 88.59 |
| Classification | 96.38 |
| Sentiment | 95.79 |
| QA | 80.40 |
Weighted average: 90.34—exceeding Alif-1.0-Instruct (87.1) by 3.24 and LLaMA-3.1 8B-Instruct (45.7) by 44.64 points. The overall score is calculated as: (where are evaluation weights with ).
5. Morphological and Script Robustness
Qalb demonstrates substantially improved capabilities for Urdu-specific phenomena:
- Morphological Coverage: Effective acquisition of case endings, compound-verb structures, and inflectional patterns prevalent in Urdu.
- Script Normalization: Robust handling of right-to-left Nastaliq script and avoidance of spurious Latin signatures found in outputs of previous baseline models.
- Versatility: Superior handling of colloquial expressions, formal documents, and literary text, surpassing generic multilingual LLMs in fluency and fidelity.
Qualitative analysis indicates that, though Alif-1.0-Instruct slightly outperforms Qalb on raw Generation metrics, Qalb generates more concise, directly relevant, and instruction-adherent text. Manual side-by-side comparisons reveal reduced repetition, clearer logical structure, and improved alignment with user prompts.
6. Design Choices and Comparative Analysis
Qalb's outperforming prior models is attributed to:
- Systematic Corpus Engineering: Balanced domain and register inclusion and high cross-lingual word-purity.
- Replay Buffer to Prevent Catastrophic Forgetting: The inclusion of English Wikipedia tokens ensures the model retains English capacity, not regressing in non-Urdu capabilities.
- Parameter-efficient Fine-tuning: The LoRA approach allows large-scale adaptation on sub-100GB hardware.
Unlike previous methods, Qalb successfully addresses the full range of Urdu language generation and understanding challenges, providing a principled, scalable blueprint for adapting foundation models to other low-resource languages.
7. Conclusion and Implications
Qalb establishes a new standard in Urdu NLP by combining large-scale, Urdu-focused continued pre-training with instruction fine-tuning in a parameter-efficient manner. Its strong benchmark performance (weighted average 90.34), robust handling of morphology and script, and qualitative alignment with user intent demonstrate that systematic adaptation of foundation models is both practical and effective for low-resource languages. This suggests broader applicability to other linguistically complex and underrepresented languages through strategic corpus curation and efficient adaptation methods (Hassan et al., 13 Jan 2026).