UrduBERT: Transformer Model for Urdu NLP

Updated 3 January 2026

UrduBERT is a transformer-based pretrained language model designed for Urdu and Roman-Urdu texts, offering enhanced contextual embeddings and robust handling of code-mixed inputs.
It utilizes 12 encoder layers, a 768-dimensional hidden space, and a vocabulary of up to 60,000 tokens to effectively process native and mixed-script data.
Empirical evaluations show UrduBERT’s superior performance in ASR-related NLU and hope-speech detection, outperforming multilingual baselines under various noise conditions.

UrduBERT is a transformer-based pretrained language representation model tailored for the Urdu language and Roman-Urdu code-mixed text, building on the standard BERT-Base architecture. UrduBERT is designed to address the performance limitations of multilingual LLMs in low-resource Urdu NLP tasks, offering enhanced contextual embeddings for both native and mixed-script utterances. Its applications span robust natural language understanding (NLU), especially in Automatic Speech Recognition (ASR)-related domains, and classification tasks such as hope-speech detection in social media. Empirical studies demonstrate its superior accuracy and robustness to linguistic variability and noise, establishing UrduBERT as a primary resource in Urdu NLP research (Khan et al., 2024, Abdullah et al., 27 Dec 2025).

1. Architecture and Pretraining Configuration

UrduBERT employs the standard BERT-Base transformer architecture comprising:

Encoder layers: 12 transformer blocks
Embedding dimensionality: 768 (hidden size)
Self-attention heads: 12 per layer
Feed-forward inner-layer size: 3,072
Vocabulary: ~60,000 WordPiece tokens for native Urdu, Roman-Urdu, and frequent English subwords (Khan et al., 2024); alternatives include ~30,000 tokens in certain code-mixed variants (Abdullah et al., 27 Dec 2025)

The pretraining corpus integrates approximately 50 million sentences from sources such as Urdu Wikipedia, national newswire corpora, Common Crawl (Urdu subset), and wide-coverage web-crawled Urdu text (Khan et al., 2024). Code-mixed versions involve Roman-Urdu social media and newswire (~5–10 GB), English Wikipedia, and Common Crawl snippets (~20 GB), yielding ~100–150 million tokens (Abdullah et al., 27 Dec 2025). Pretraining objectives follow the original BERT protocol: masked language modeling (MLM) with 15% random token masking, and next-sentence prediction (NSP).

2. Tokenization and Text Normalization

UrduBERT utilizes a WordPiece tokenizer trained on multilayered corpora to segment Urdu, Roman-Urdu, and English text. The tokenizer is engineered for robust handling of out-of-vocabulary words and code-mixed sentences by breaking tokens into subword units instead of discarding them. Text normalization for Urdu and Roman-Urdu involves:

Removal of diacritics (zabar, zer, pesh)
Standardization of punctuation to ASCII
Lowercasing Roman-Urdu
Collapsing character repetitions (e.g., “claasss” → “class”)
Padded/truncated sequence length up to 128 subword tokens (Abdullah et al., 27 Dec 2025)

No explicit language identification tags are used during code-mixing; the unified vocabulary encodes both scripts.

3. Fine-Tuning Methodologies

ASR-Robust NLU Adaptation

For transfer learning in smart-home command understanding (Khan et al., 2024):

Data Augmentation: Audiomentations library introduces pitch shifts (±2 semitones), time-stretching (0.9×–1.1×), and multiple noise profiles (white Gaussian noise, babble, environmental) with SNR from 20 dB to 0 dB.
ASR Error Simulation: Empirical confusion matrix simulates character/word-level errors at WER regimes of 10%, 20%, and 30%.
Freeze-Fine-Tune Strategy: Bottom N=8 transformer layers frozen while the top 4 layers plus a random classification head are fine-tuned on domain data, followed by unfreezing all layers and learning-rate decay.
Cross-Validation: 5-fold stratified cross-validation for stable generalization statistics
Domain Adaptation: Tested on restaurant-booking data without further fine-tuning to assess transferability

Hope Speech Detection

For hope-speech classification, UrduBERT is integrated with the GHaLIB framework (Abdullah et al., 27 Dec 2025):

Classification Head: Dropout (rate 0.1–0.3) applied to [CLS] embedding (768), followed by a linear projection to C output classes (C=2 for binary, C=4 for multi-class) and softmax activation
Weighted Cross-Entropy Loss: Positive (“hope”) class receives 1.5× weight to mitigate class imbalance
Optimization: AdamW optimizer with learning rates 5×10⁻⁶–5×10⁻⁵ and batch sizes 4–16, tuned via Optuna across 30 trials
Regularization: Dropout and weight decay in [0.0, 0.1]; early stopping on validation F₁ within 3–5 epochs

4. Training Objectives and Regularization

For intent classification and robust NLU (Khan et al., 2024):

Cross-Entropy Loss:

$\mathcal{L}_{CE} = -\sum_{i=1}^{C}y_{i}\log\hat y_{i}$

where $C$ = number of classes, $y_i$ = ground-truth one-hot label, $\hat y_i$ = softmax output

Consistency Loss:

$\mathcal{L}_{cons} = \lVert h(x_{clean}) - h(x_{noisy}) \rVert_2^2$

$h(x)$ = penultimate layer output; aligns representations of clean and noise-augmented inputs

Total Loss:

$\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{cons}$

with $\lambda = 0.5$ as regularization weight

For hope-speech classification (Abdullah et al., 27 Dec 2025), standard weighted cross-entropy is employed, and evaluation metrics are macro-averaged accuracy, precision, recall, and F₁-score.

5. Core Datasets and Task Coverage

Smart-Home ASR NLU

Data Source: "Home Automation System in Urdu Language" (Kaggle)
Size: ~8,000 audio-transcript pairs
Intent Taxonomy: 20 distinct smart-home intents
Entities: device_type, location, numeric_value, mode
Speaker Diversity: 50+; widespread urban-rural and regional accent variability
Noise Conditions: ambient (fan, traffic); WER assessment: 15% (clean), 30% (medium noise), 50% (heavy noise)

Hope Speech Detection

Source: PolyHope-M 2025 shared task
Size: ≈8,000 labeled samples
Labels: Generalized Hope, Realistic Hope, Unrealistic Hope, Not Hope
Class Distributions: Not Hope (~50%), Generalized Hope (~30%), Realistic Hope (~12%), Unrealistic Hope (~8%)
Splits: train/val/test stratified 70/15/15%

6. Evaluation Protocols and Empirical Performance

Condition	Acc (%) UrduBERT	Acc (%) baseline mBERT	ΔAcc vs. clean
Clean	94.7	91.2	–
Simulated ASR WER=10%	92.1	88.5	–2.6
Medium noise (SNR=10dB, ~30% WER)	86.3	82.9	–8.4
Heavy noise (SNR=0dB, ~50% WER)	76.4	68.5	–18.3

Key metrics (Khan et al., 2024):

Robustness: ΔAcc(noise), Robustness Ratio $R = Acc_{noisy} / Acc_{clean}$ , WER-ΔAcc correlation
Entity Extraction (Slot Filling): F₁ declines from 93% (clean) to 78% (heavy noise)
Latency: Mean inference time ~45 ms; 95th percentile ~80 ms; stable <50 ms across regimes
User Satisfaction: ≥4.2/5 on Likert scale

Hope-speech task metrics (Abdullah et al., 27 Dec 2025):

Binary classification (UrduBERT + XLM-R backbone):

Accuracy: 95% | Precision: 95% | Recall: 95% | F₁ (macro): 95%

Multi-class classification:

RoBERTa-Urdu + GHaLIB: F₁ (macro): 65.2% UrduBERT + classical ensemble: F₁ (macro): ~62.3%

7. Limitations, Transferability, and Future Work

UrduBERT demonstrates high robustness and transferability in both ASR-NLU and social-text settings:

ASR-NLU: Accuracy degrades gracefully with noise, only ~8% loss at real-world medium noise levels, outperforming mBERT by 3–8% absolute accuracy (Khan et al., 2024).
Hope-speech: Excels in binary hope-speech classification (95% F₁), but multi-class performance is impacted by class imbalance and scarce annotated data (Abdullah et al., 27 Dec 2025).

Limitations include:

Data Scarcity: Urdu corpora and labeled tasks remain limited for fully modeling underrepresented classes and pragmatic contexts.
Class Imbalance and Ambiguity: Weighted losses partially mitigate, but rare hope-speech types (“Unrealistic/Realistic Hope”) remain challenging.
Morphological and Code-Mixed Complexity: Rich inflectional morphology and English–Urdu code-mixing require robust subword modeling.
Domain Mismatch: Pretraining on general text leaves gaps when adapting to specific pragmatic/discourse genres.
Comparative Trade-Offs: RoBERTa-Urdu outperforms pure UrduBERT for multi-class hope speech, at the cost of monolingual model maintenance.

Future directions highlighted include larger and more diverse corpus development, extending to other regional languages, evaluation of adversarial domain adaptation methods, and unified end-to-end ASR+NLU pipelines (Khan et al., 2024).

UrduBERT establishes a strong foundation for Urdu NLP in both high-noise spoken command domains and low-resource, pragmatic text classification. Its technical design, empirical validation, and nuanced handling of Urdu and Roman-Urdu make it an essential tool for research groups addressing the challenges of under-resourced language technologies (Khan et al., 2024, Abdullah et al., 27 Dec 2025).

PDF Markdown Chat (Pro)

References (2)

Transcending Controlled Environments Assessing the Transferability of ASRRobust NLU Models to Real-World Applications (2024)

GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UrduBERT.

UrduBERT: Transformer Model for Urdu NLP

1. Architecture and Pretraining Configuration

2. Tokenization and Text Normalization

3. Fine-Tuning Methodologies

ASR-Robust NLU Adaptation

Hope Speech Detection

4. Training Objectives and Regularization

5. Core Datasets and Task Coverage

Smart-Home ASR NLU

Hope Speech Detection

6. Evaluation Protocols and Empirical Performance

7. Limitations, Transferability, and Future Work

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics