ASRU MADASR 2.0 Challenge: Indian ASR Benchmark
- ASRU MADASR 2.0 Challenge is an international competition advancing ASR for Indian languages, featuring 8 languages and 33 dialects under strict data constraints.
- It emphasizes innovative architectures such as multi-decoder systems with phonemic label sets to improve transcription accuracy and dialect discrimination.
- The challenge employs rigorous evaluation protocols using blind test splits, data augmentation, and external language models to benchmark ASR performance.
The ASRU MADASR 2.0 Challenge is an international research competition designed to advance automatic speech recognition (ASR) for Indian languages and dialects, under low-resource conditions and stringent data constraints. It establishes a rigorous evaluation schema for both transcription accuracy and the ability to robustly identify spoken language and dialect, leveraging a diverse speech corpus across 8 languages and 33 dialects. The challenge pushes development of scalable, multilingual, and dialect-aware ASR architectures, ranging from from-scratch systems to those leveraging state-of-the-art pretraining and adaptation approaches.
1. Challenge Structure and Evaluation Protocol
The ASRU MADASR 2.0 Challenge provides organizer-curated 1,200 hours of speech data sampled at 16 kHz for eight Indian languages: Bhojpuri, Bengali, Chhattisgarhi, Kannada, Magahi, Marathi, Maithili, and Telugu. Each language is subdivided into numerous dialects, and the corpus is balanced between read and spontaneous speech modalities. Four tracks control permitted training data:
- Track 1: Only 30 hours per language (no external data allowed).
- Track 2: 120 hours per language (full in-domain, no external data).
- Tracks 3/4: Track 1/2 conditions with additional permission to use open-source external data and pretrained models.
Evaluation comprises blind test splits, with the following metrics:
- Word Error Rate (WER) and Character Error Rate (CER) for transcription fidelity
- Language-ID Accuracy (LID-acc)
- Dialect-ID Accuracy (DID-acc)
This rigorously tests both ASR and discriminative classification under linguistic and dialectal diversity (Gangwar et al., 19 Nov 2025).
2. Multilingual and Dialect-Aware ASR Architectures
Recent participant systems report significant advances, most notably the introduction of a "Multi-Decoder" architecture with a phonemic Common Label Set (CLS) serving as a shared intermediate representation across languages and dialects. This framework consists of:
- ASR Sub-network: Conformer encoder and Transformer decoder producing CLS token sequences
- MT Sub-network: Lightweight Transformer using ASR decoder hidden states to transliterate CLS outputs into native grapheme scripts
Diagrammatic Description (as in (Gangwar et al., 19 Nov 2025)):
| Sub-module | Main Function | Input/Output |
|---|---|---|
| Conformer Encoder | Acoustic modeling | 80-dim log-Mel X → H |
| ASR Decoder | CLS sequence modeling | H → CLS logits |
| MT Encoder/Decoder | Transliteration | (S, H) → native script |
The CLS reduces vocabulary size from grapheme-level (>3,000) to phoneme-level (500–1,000). Deterministic mapping and learned (where is the union of all graphemes) are applied. Joint training uses hybrid CTC-Attention loss for ASR and cross-entropy for MT, weighted to prioritize transcription convergence () (Gangwar et al., 19 Nov 2025).
3. Training Protocols and Data Augmentation
Systems rely on a combination of data regularization and curriculum learning to overcome data scarcity and dialectal imbalance:
- SpecAugment (time-frequency masking)
- Dropout ()
- LayerNorm and label smoothing ()
- Curriculum learning: Initial pretraining on CLS-encoded ASR before joint optimization
CLS tokenization (character/phoneme or BPE) ensures robust modeling of cross-language and cross-dialect regularities. Post-processing via rule-based corrections (e.g., finite-state rules for schwa deletion in Devanagari languages) offers additional CER improvements (~0.5% absolute for read speech) (Gangwar et al., 19 Nov 2025).
4. Integration of LLMs and Decoding Strategies
External LLMs, particularly BPE-based KenLM, are leveraged when permitted by track rules. The KenLM model is trained on in-corpus and dialect-rich text and fused during beam search decoding using a composite score:
where is attention probability, is Connectionist Temporal Classification probability, is LLM probability, and are set empirically on development data. Modified beam search accommodates external LM integration, including modifications to Whisper’s decoding routines to support LM scoring at each partial token sequence (Li et al., 2023).
5. System Performance and Analysis
Track 2 test results illustrate the performance of advanced systems:
| Model/Speech Type | Avg CER | Avg WER | LID acc | DID acc |
|---|---|---|---|---|
| Baseline w/ Dialect token (Char) Read | 4.30 | 16.92 | 96.03% | – |
| Multi-Decoder (Char) Read | 4.40 | 17.06 | 97.39% | 75.36% |
| Baseline Spontaneous | 25.11 | 59.01 | 76.89% | – |
| Multi-Decoder Spontaneous | 26.99 | 62.32 | 77.61% | 33.33% |
CLS-based models achieve the highest LID and DID accuracy (up to 77.61% and 75.36%, respectively) among all Track 2 participants, with modest CER/WER increases on spontaneous speech but substantial gains in dialect discrimination where the baseline cannot disambiguate (Gangwar et al., 19 Nov 2025).
Language-wise improvements were most pronounced for script-sharing dialects (e.g., Bhojpuri, Magahi); consolidation in CLS space reduces cross-dialect confusions. Dravidian languages also experienced gains, though limited by script divergence.
6. Limitations and Future Directions
Converting CLS outputs to native grapheme scripts introduces residual transcription errors, especially for schwa deletion in Indo-Aryan languages; improved G2P mappings or weighted edit distance decoders may mitigate this. Dialect identification for spontaneous speech remains a challenge (DID ≈ 33%), indicating potential for explicit DID sub-decoder modules and further loss regularization.
Incorporating external pretrained multilingual embeddings (as permitted in Track 4) presents a promising frontier. The challenge results emphasize the effectiveness of joint ASR-Transliteration architectures and phonemic label sets, but also highlight need for further work in dialect and prosody generalization, alignment stabilization, and efficient adaptation to additional Indian languages (Gangwar et al., 19 Nov 2025, Li et al., 2023).
7. Broader Impact and Significance
The ASRU MADASR 2.0 Challenge formalizes a reproducible, multilingual, and multi-dialect ASR benchmark under low-resource constraints, catalyzing advances in cross-lingual modeling, dialect discrimination, and integrated transcription–transliteration pipelines. Its protocol and outcomes provide a reference standard for further research in scalable ASR for rich and diverse linguistic landscapes, with significant implications for speech technology deployment in India and globally.