Papers
Topics
Authors
Recent
2000 character limit reached

ASRU MADASR 2.0 Challenge: Indian ASR Benchmark

Updated 26 November 2025
  • ASRU MADASR 2.0 Challenge is an international competition advancing ASR for Indian languages, featuring 8 languages and 33 dialects under strict data constraints.
  • It emphasizes innovative architectures such as multi-decoder systems with phonemic label sets to improve transcription accuracy and dialect discrimination.
  • The challenge employs rigorous evaluation protocols using blind test splits, data augmentation, and external language models to benchmark ASR performance.

The ASRU MADASR 2.0 Challenge is an international research competition designed to advance automatic speech recognition (ASR) for Indian languages and dialects, under low-resource conditions and stringent data constraints. It establishes a rigorous evaluation schema for both transcription accuracy and the ability to robustly identify spoken language and dialect, leveraging a diverse speech corpus across 8 languages and 33 dialects. The challenge pushes development of scalable, multilingual, and dialect-aware ASR architectures, ranging from from-scratch systems to those leveraging state-of-the-art pretraining and adaptation approaches.

1. Challenge Structure and Evaluation Protocol

The ASRU MADASR 2.0 Challenge provides organizer-curated 1,200 hours of speech data sampled at 16 kHz for eight Indian languages: Bhojpuri, Bengali, Chhattisgarhi, Kannada, Magahi, Marathi, Maithili, and Telugu. Each language is subdivided into numerous dialects, and the corpus is balanced between read and spontaneous speech modalities. Four tracks control permitted training data:

  • Track 1: Only 30 hours per language (no external data allowed).
  • Track 2: 120 hours per language (full in-domain, no external data).
  • Tracks 3/4: Track 1/2 conditions with additional permission to use open-source external data and pretrained models.

Evaluation comprises blind test splits, with the following metrics:

  • Word Error Rate (WER) and Character Error Rate (CER) for transcription fidelity
  • Language-ID Accuracy (LID-acc)
  • Dialect-ID Accuracy (DID-acc)

This rigorously tests both ASR and discriminative classification under linguistic and dialectal diversity (Gangwar et al., 19 Nov 2025).

2. Multilingual and Dialect-Aware ASR Architectures

Recent participant systems report significant advances, most notably the introduction of a "Multi-Decoder" architecture with a phonemic Common Label Set (CLS) serving as a shared intermediate representation across languages and dialects. This framework consists of:

  • ASR Sub-network: Conformer encoder and Transformer decoder producing CLS token sequences
  • MT Sub-network: Lightweight Transformer using ASR decoder hidden states to transliterate CLS outputs into native grapheme scripts

Diagrammatic Description (as in (Gangwar et al., 19 Nov 2025)):

Sub-module Main Function Input/Output
Conformer Encoder Acoustic modeling 80-dim log-Mel X → H
ASR Decoder CLS sequence modeling H → CLS logits
MT Encoder/Decoder Transliteration (S, H) → native script

The CLS reduces vocabulary size from grapheme-level (>3,000) to phoneme-level (500–1,000). Deterministic mapping fgc:GCf_{g \rightarrow c}: \mathbb{G} \rightarrow C and learned fcg:CGf_{c \rightarrow g}: C \rightarrow \mathbb{G} (where G\mathbb{G} is the union of all graphemes) are applied. Joint training uses hybrid CTC-Attention loss for ASR and cross-entropy for MT, weighted to prioritize transcription convergence (λASR=0.8\lambda_{ASR}=0.8) (Gangwar et al., 19 Nov 2025).

3. Training Protocols and Data Augmentation

Systems rely on a combination of data regularization and curriculum learning to overcome data scarcity and dialectal imbalance:

  • SpecAugment (time-frequency masking)
  • Dropout (p=0.1p = 0.1)
  • LayerNorm and label smoothing (ϵ=0.1\epsilon = 0.1)
  • Curriculum learning: Initial pretraining on CLS-encoded ASR before joint optimization

CLS tokenization (character/phoneme or BPE) ensures robust modeling of cross-language and cross-dialect regularities. Post-processing via rule-based corrections (e.g., finite-state rules for schwa deletion in Devanagari languages) offers additional CER improvements (~0.5% absolute for read speech) (Gangwar et al., 19 Nov 2025).

4. Integration of LLMs and Decoding Strategies

External LLMs, particularly BPE-based KenLM, are leveraged when permitted by track rules. The KenLM model is trained on in-corpus and dialect-rich text and fused during beam search decoding using a composite score:

Score(y)=logpAM(yx)+λlogpCTC(yx)+βlogpLM(y)+γyScore(y) = \log p_{AM}(y|x) + \lambda \log p_{CTC}(y|x) + \beta \log p_{LM}(y) + \gamma |y|

where pAMp_{AM} is attention probability, pCTCp_{CTC} is Connectionist Temporal Classification probability, pLMp_{LM} is LLM probability, and λ,β,γ\lambda, \beta, \gamma are set empirically on development data. Modified beam search accommodates external LM integration, including modifications to Whisper’s decoding routines to support LM scoring at each partial token sequence (Li et al., 2023).

5. System Performance and Analysis

Track 2 test results illustrate the performance of advanced systems:

Model/Speech Type Avg CER Avg WER LID acc DID acc
Baseline w/ Dialect token (Char) Read 4.30 16.92 96.03%
Multi-Decoder (Char) Read 4.40 17.06 97.39% 75.36%
Baseline Spontaneous 25.11 59.01 76.89%
Multi-Decoder Spontaneous 26.99 62.32 77.61% 33.33%

CLS-based models achieve the highest LID and DID accuracy (up to 77.61% and 75.36%, respectively) among all Track 2 participants, with modest CER/WER increases on spontaneous speech but substantial gains in dialect discrimination where the baseline cannot disambiguate (Gangwar et al., 19 Nov 2025).

Language-wise improvements were most pronounced for script-sharing dialects (e.g., Bhojpuri, Magahi); consolidation in CLS space reduces cross-dialect confusions. Dravidian languages also experienced gains, though limited by script divergence.

6. Limitations and Future Directions

Converting CLS outputs to native grapheme scripts introduces residual transcription errors, especially for schwa deletion in Indo-Aryan languages; improved G2P mappings or weighted edit distance decoders may mitigate this. Dialect identification for spontaneous speech remains a challenge (DID ≈ 33%), indicating potential for explicit DID sub-decoder modules and further loss regularization.

Incorporating external pretrained multilingual embeddings (as permitted in Track 4) presents a promising frontier. The challenge results emphasize the effectiveness of joint ASR-Transliteration architectures and phonemic label sets, but also highlight need for further work in dialect and prosody generalization, alignment stabilization, and efficient adaptation to additional Indian languages (Gangwar et al., 19 Nov 2025, Li et al., 2023).

7. Broader Impact and Significance

The ASRU MADASR 2.0 Challenge formalizes a reproducible, multilingual, and multi-dialect ASR benchmark under low-resource constraints, catalyzing advances in cross-lingual modeling, dialect discrimination, and integrated transcription–transliteration pipelines. Its protocol and outcomes provide a reference standard for further research in scalable ASR for rich and diverse linguistic landscapes, with significant implications for speech technology deployment in India and globally.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ASRU MADASR 2.0 Challenge.