UrduBERT: Transformer Model for Urdu NLP
- UrduBERT is a transformer-based pretrained language model designed for Urdu and Roman-Urdu texts, offering enhanced contextual embeddings and robust handling of code-mixed inputs.
- It utilizes 12 encoder layers, a 768-dimensional hidden space, and a vocabulary of up to 60,000 tokens to effectively process native and mixed-script data.
- Empirical evaluations show UrduBERT’s superior performance in ASR-related NLU and hope-speech detection, outperforming multilingual baselines under various noise conditions.
UrduBERT is a transformer-based pretrained language representation model tailored for the Urdu language and Roman-Urdu code-mixed text, building on the standard BERT-Base architecture. UrduBERT is designed to address the performance limitations of multilingual LLMs in low-resource Urdu NLP tasks, offering enhanced contextual embeddings for both native and mixed-script utterances. Its applications span robust natural language understanding (NLU), especially in Automatic Speech Recognition (ASR)-related domains, and classification tasks such as hope-speech detection in social media. Empirical studies demonstrate its superior accuracy and robustness to linguistic variability and noise, establishing UrduBERT as a primary resource in Urdu NLP research (Khan et al., 2024, Abdullah et al., 27 Dec 2025).
1. Architecture and Pretraining Configuration
UrduBERT employs the standard BERT-Base transformer architecture comprising:
- Encoder layers: 12 transformer blocks
- Embedding dimensionality: 768 (hidden size)
- Self-attention heads: 12 per layer
- Feed-forward inner-layer size: 3,072
- Vocabulary: ~60,000 WordPiece tokens for native Urdu, Roman-Urdu, and frequent English subwords (Khan et al., 2024); alternatives include ~30,000 tokens in certain code-mixed variants (Abdullah et al., 27 Dec 2025)
The pretraining corpus integrates approximately 50 million sentences from sources such as Urdu Wikipedia, national newswire corpora, Common Crawl (Urdu subset), and wide-coverage web-crawled Urdu text (Khan et al., 2024). Code-mixed versions involve Roman-Urdu social media and newswire (~5–10 GB), English Wikipedia, and Common Crawl snippets (~20 GB), yielding ~100–150 million tokens (Abdullah et al., 27 Dec 2025). Pretraining objectives follow the original BERT protocol: masked language modeling (MLM) with 15% random token masking, and next-sentence prediction (NSP).
2. Tokenization and Text Normalization
UrduBERT utilizes a WordPiece tokenizer trained on multilayered corpora to segment Urdu, Roman-Urdu, and English text. The tokenizer is engineered for robust handling of out-of-vocabulary words and code-mixed sentences by breaking tokens into subword units instead of discarding them. Text normalization for Urdu and Roman-Urdu involves:
- Removal of diacritics (zabar, zer, pesh)
- Standardization of punctuation to ASCII
- Lowercasing Roman-Urdu
- Collapsing character repetitions (e.g., “claasss” → “class”)
- Padded/truncated sequence length up to 128 subword tokens (Abdullah et al., 27 Dec 2025)
No explicit language identification tags are used during code-mixing; the unified vocabulary encodes both scripts.
3. Fine-Tuning Methodologies
ASR-Robust NLU Adaptation
For transfer learning in smart-home command understanding (Khan et al., 2024):
- Data Augmentation: Audiomentations library introduces pitch shifts (±2 semitones), time-stretching (0.9×–1.1×), and multiple noise profiles (white Gaussian noise, babble, environmental) with SNR from 20 dB to 0 dB.
- ASR Error Simulation: Empirical confusion matrix simulates character/word-level errors at WER regimes of 10%, 20%, and 30%.
- Freeze-Fine-Tune Strategy: Bottom N=8 transformer layers frozen while the top 4 layers plus a random classification head are fine-tuned on domain data, followed by unfreezing all layers and learning-rate decay.
- Cross-Validation: 5-fold stratified cross-validation for stable generalization statistics
- Domain Adaptation: Tested on restaurant-booking data without further fine-tuning to assess transferability
Hope Speech Detection
For hope-speech classification, UrduBERT is integrated with the GHaLIB framework (Abdullah et al., 27 Dec 2025):
- Classification Head: Dropout (rate 0.1–0.3) applied to [CLS] embedding (768), followed by a linear projection to C output classes (C=2 for binary, C=4 for multi-class) and softmax activation
- Weighted Cross-Entropy Loss: Positive (“hope”) class receives 1.5× weight to mitigate class imbalance
- Optimization: AdamW optimizer with learning rates 5×10⁻⁶–5×10⁻⁵ and batch sizes 4–16, tuned via Optuna across 30 trials
- Regularization: Dropout and weight decay in [0.0, 0.1]; early stopping on validation F₁ within 3–5 epochs
4. Training Objectives and Regularization
For intent classification and robust NLU (Khan et al., 2024):
- Cross-Entropy Loss:
where = number of classes, = ground-truth one-hot label, = softmax output
- Consistency Loss:
= penultimate layer output; aligns representations of clean and noise-augmented inputs
- Total Loss:
with as regularization weight
For hope-speech classification (Abdullah et al., 27 Dec 2025), standard weighted cross-entropy is employed, and evaluation metrics are macro-averaged accuracy, precision, recall, and F₁-score.
5. Core Datasets and Task Coverage
Smart-Home ASR NLU
- Data Source: "Home Automation System in Urdu Language" (Kaggle)
- Size: ~8,000 audio-transcript pairs
- Intent Taxonomy: 20 distinct smart-home intents
- Entities: device_type, location, numeric_value, mode
- Speaker Diversity: 50+; widespread urban-rural and regional accent variability
- Noise Conditions: ambient (fan, traffic); WER assessment: 15% (clean), 30% (medium noise), 50% (heavy noise)
Hope Speech Detection
- Source: PolyHope-M 2025 shared task
- Size: ≈8,000 labeled samples
- Labels: Generalized Hope, Realistic Hope, Unrealistic Hope, Not Hope
- Class Distributions: Not Hope (~50%), Generalized Hope (~30%), Realistic Hope (~12%), Unrealistic Hope (~8%)
- Splits: train/val/test stratified 70/15/15%
6. Evaluation Protocols and Empirical Performance
| Condition | Acc (%) UrduBERT | Acc (%) baseline mBERT | ΔAcc vs. clean |
|---|---|---|---|
| Clean | 94.7 | 91.2 | – |
| Simulated ASR WER=10% | 92.1 | 88.5 | –2.6 |
| Medium noise (SNR=10dB, ~30% WER) | 86.3 | 82.9 | –8.4 |
| Heavy noise (SNR=0dB, ~50% WER) | 76.4 | 68.5 | –18.3 |
Key metrics (Khan et al., 2024):
- Robustness: ΔAcc(noise), Robustness Ratio , WER-ΔAcc correlation
- Entity Extraction (Slot Filling): F₁ declines from 93% (clean) to 78% (heavy noise)
- Latency: Mean inference time ~45 ms; 95th percentile ~80 ms; stable <50 ms across regimes
- User Satisfaction: ≥4.2/5 on Likert scale
Hope-speech task metrics (Abdullah et al., 27 Dec 2025):
- Binary classification (UrduBERT + XLM-R backbone):
Accuracy: 95% | Precision: 95% | Recall: 95% | F₁ (macro): 95%
- Multi-class classification:
RoBERTa-Urdu + GHaLIB: F₁ (macro): 65.2% UrduBERT + classical ensemble: F₁ (macro): ~62.3%
7. Limitations, Transferability, and Future Work
UrduBERT demonstrates high robustness and transferability in both ASR-NLU and social-text settings:
- ASR-NLU: Accuracy degrades gracefully with noise, only ~8% loss at real-world medium noise levels, outperforming mBERT by 3–8% absolute accuracy (Khan et al., 2024).
- Hope-speech: Excels in binary hope-speech classification (95% F₁), but multi-class performance is impacted by class imbalance and scarce annotated data (Abdullah et al., 27 Dec 2025).
Limitations include:
- Data Scarcity: Urdu corpora and labeled tasks remain limited for fully modeling underrepresented classes and pragmatic contexts.
- Class Imbalance and Ambiguity: Weighted losses partially mitigate, but rare hope-speech types (“Unrealistic/Realistic Hope”) remain challenging.
- Morphological and Code-Mixed Complexity: Rich inflectional morphology and English–Urdu code-mixing require robust subword modeling.
- Domain Mismatch: Pretraining on general text leaves gaps when adapting to specific pragmatic/discourse genres.
- Comparative Trade-Offs: RoBERTa-Urdu outperforms pure UrduBERT for multi-class hope speech, at the cost of monolingual model maintenance.
Future directions highlighted include larger and more diverse corpus development, extending to other regional languages, evaluation of adversarial domain adaptation methods, and unified end-to-end ASR+NLU pipelines (Khan et al., 2024).
UrduBERT establishes a strong foundation for Urdu NLP in both high-noise spoken command domains and low-resource, pragmatic text classification. Its technical design, empirical validation, and nuanced handling of Urdu and Roman-Urdu make it an essential tool for research groups addressing the challenges of under-resourced language technologies (Khan et al., 2024, Abdullah et al., 27 Dec 2025).