Papers
Topics
Authors
Recent
Search
2000 character limit reached

ICASSP 2026 URGENT Challenge

Updated 27 January 2026
  • ICASSP 2026 URGENT Challenge is a benchmarking effort that tests universal speech enhancement and MOS prediction across diverse acoustic conditions.
  • The challenge incentivizes developing deep learning-based SE models that handle noise, reverberation, codec artifacts, and variable sampling rates.
  • Evaluation protocols combine objective metrics and human perceptual tests using curated multilingual datasets to mirror real-world applications.

The ICASSP 2026 URGENT Challenge is a major international benchmarking effort advancing research on universal speech enhancement (SE) and perceptual quality assessment for SE-processed speech. It comprehensively targets universality, robustness, and generalizability in SE systems, requiring models to handle a diversity of distortions, domains, speakers, languages, and conditions, in line with recent developments in deep learning-based SE research (Li et al., 20 Jan 2026, &&&1&&&).

1. Scope, Motivation, and Objectives

The central goal of the URGENT 2026 Challenge is to foster the development of “universal” SE systems. These are defined as systems capable of generalizing across diverse acoustic degradations (including additive noise, reverberation, clipping, codec artifacts, packet loss, wind noise, and bandwidth limitation), a broad array of domains (languages, speaking styles, emotional content), and variable recording conditions (sampling rates ranging from 8 to 48 kHz). The challenge also introduces perceptual quality assessment of enhanced speech as a principal research axis, acknowledging the critical gap in automatic MOS prediction for SE outputs (Li et al., 20 Jan 2026).

Key objectives are the construction and evaluation of:

  • SE systems that operate agnostically with respect to input distortions and conditions.
  • Assessment modules predicting Mean Opinion Score (MOS) for enhanced speech.
  • Data curation pipelines selecting high-quality, representative training examples.
  • Evaluation protocols that combine objective, subjective, intrusive, non-intrusive, and downstream-relevant metrics.

The practical motivation stems from limitations of current neural SE models in real-world applications, such as telephony, conferencing, hearing aids, and voice assistants. These systems often fail with previously unseen distortions, underscoring the need for robust, generalizable SE and reliable perceptual quality predictors (Li et al., 20 Jan 2026, Zhang et al., 2024).

2. Challenge Structure and Task Definitions

The 2026 URGENT Challenge comprises two distinct but complementary tracks:

Track 1: Universal Speech Enhancement

Participants must develop a single SE model to process any input—regardless of distortion type or sampling rate—and output an enhanced waveform that closely matches a clean reference or maximizes perceptual quality on unpaired real recordings. The model must operate independently of prior knowledge about the specific noise or degradation; allowed input formats include raw waveform or STFT representations.

Track 2: Speech Quality Assessment for SE

Given speech already processed by SE systems, entrants are to predict a MOS (scalar, range roughly 1–5) for each utterance or frame, with the target matching human judgments gathered via ITU-T P.808 procedures. Both waveform and feature inputs (e.g., log-Mel spectrogram) are permitted.

This bifurcated structure explicitly drives research in both universal SE algorithm design and the automatic evaluation of their outputs, supporting system development and deployment at scale (Li et al., 20 Jan 2026).

3. Dataset Curation and Composition

Data diversity and quality are central to URGENT. All datasets are open or publicly licensed: no proprietary or private data is permitted.

Track 1

  • Training/validation: ~700 hours curated from a pool of 2,500 hours, including corpora such as LibriVox, LibriTTS, VCTK, CommonVoice (across five languages), NNCES (children), EARS (studio), SeniorTalk (elderly Mandarin), VocalSet (singing), Emotional Speech Database, and others.
  • Noise sources: AudioSet, WHAM!, FSD50K, Free Music Archive.
  • Room Impulse Responses: Simulated using the DNS5 engine.
  • Curation protocol: A quality-scoring model (Li et al., “Less is More”) is used to select high-quality speech; preprocessing includes bandwidth estimation, silence trimming (py-webrtcvad), and DNSMOS-based filtering.
  • Blind test: 360 simulated mixtures (unseen speakers/noises/RIRs) and 480 real-world recordings drawing on five additional unseen languages.

Track 2

  • Training: MOS-annotated sources include synthesized (BC19, SOMOS, TTSDS2), voice-converted (BVCC), telephony/VoIP (PSTN, TCD-VoIP), noise-induced (TMHINT-QI), and SE outputs (Tencent, URGENT2024-SQA, URGENT2025-SQA).
  • Blind test set: 8,000 utterances across 16 SE systems, each rated with ITU-T P.808 MOS (Li et al., 20 Jan 2026).

Dataset Curation Summary

Dataset Speech diversity Noises/RIRs Annotation
Track 1 Train 700 h / 2,500 h Extensive Clean/noisy pairs
Track 1 Blind 360 + 480 utt. Diverse Hidden references
Track 2 Train MOS-labeled Varied ITU-T P.808 MOS
Track 2 Blind 8,000 utt. Broad SE Human MOS ratings

4. Baseline Systems and Model Designs

SE Baselines (Track 1)

  • Discriminative: BSRNN [Yu et al. 2023]: a sampling-rate-independent, adaptive STFT front-end and dual-path recurrent spectral mapping network minimizing MSE in the complex STFT domain.
  • Generative: FlowSE [Lee et al. 2025]: a conditional flow-matching approach sampling enhanced waveforms from a learned distribution, trained using a flow-matching loss.
  • Hybrid approaches (observed in top teams): Coarse restoration by FlowSE, followed by fine spectral refinement with BSRNN.

SQA Baselines (Track 2)

  • Uni-VERSA-Ext: Extended multi-metric supervision combining MSE on MOS, ranking and correlation losses.
  • URGENT-PK: (out of competition) Pairwise ranking model optimizing for alignment with human MOS (Li et al., 20 Jan 2026).

Baseline experiments from the URGENT framework also include classic and recent discriminative and generative models such as OM-LSA, Conv-TasNet, TF-GridNet, and VoiceFixer (Zhang et al., 2024).

5. Evaluation Protocols, Metrics, and Procedures

Track 1 (Universal SE)

  • Stage 1 (Objective): Friedman-test-based ranking, aggregating a wide range of objective metrics:
    • Non-intrusive: DNSMOS, NISQA, UTMOS, SCOREQ
    • Intrusive: PESQ (ITU-P.862), ESTOI, POLQA (ITU-P.863)
    • Downstream-independent: SpeechBERTScore, log-power spectrum (LPS) distance
    • Downstream-dependent: speaker verification EER, emotion classification accuracy, language ID accuracy, ASR character error rate
  • Stage 2 (Subjective): Human listening evaluation (ITU-T P.808 ACR and CCR, significance testing per ETSI TS 126 077).
  • Key Equations:
    • Signal-to-Distortion Ratio (SDR): SDR=10log10starget2einterf+enoise+eartif2SDR = 10\log_{10} \frac{\| s_\text{target} \|^2}{\|e_\text{interf} + e_\text{noise} + e_\text{artif}\|^2}
    • Scale-Invariant SDR: α=s^,ss2,SI-SDR=10log10αs2αss^2\alpha = \frac{\langle \hat{s}, s \rangle}{\| s \|^2},\quad \text{SI-SDR} = 10\log_{10} \frac{\|\alpha s\|^2}{\|\alpha s - \hat{s}\|^2}

Track 2 (SQA)

  • Metrics at utterance and system level:
    • MSE: E[(y^y)2]E[(\hat{y} - y)^2]
    • Pearson Correlation (PLCC), Spearman’s Rank Correlation (SRCC), Kendall’s tau (τ\tau)
  • Dense ranking and final ranking by averaged error and correlation-based scores.

A multi-metric approach is mandatory, including twelve reference-based, non-intrusive, downstream-independent, and downstream-dependent metrics in the general challenge framework (Zhang et al., 2024).

6. Results and Analysis

Track 1

  • Participation: 23 valid systems; top 6 advanced to subjective evaluation.
  • Best systems: Hybrid generative–discriminative pipelines (FlowSE followed by BSRNN) achieved superior generalization, especially when paired with MOS-based data curation and augmentation.
  • Relative improvements: SI-SDR increased by +2–3 dB, PESQ by +0.2–0.3, and DNSMOS by +0.15 over competitive baselines.
  • Key lesson: High-quality, curated data yields greater generalization than increasing dataset size indiscriminately.
System SI-SDR (dB) PESQ DNSMOS
Baseline (FlowSE) 12.1 2.85 3.10
Top Team A 14.8 3.12 3.32
Top Team B 14.2 3.05 3.29

Track 2

  • Participation: 6 valid systems.
  • Best system-level correlations: PLCC ≈ 0.93, SRCC ≈ 0.91, τ\tau ≈ 0.78 (Uni-VERSA-Ext variants).
  • Best (out of competition): URGENT-PK for MSE and PLCC.
  • BSRNN/TF-GridNet: High PESQ (2.66/2.76), ESTOI (83.3/84.1), and SDR (14.9/15.4), low MCD/LSD.
  • Conv-TasNet: Modest PESQ (2.42), SDR (14.4), weak on BWE/declipping.
  • VoiceFixer: Top DNSMOS (2.93), NISQA (3.65), poor on SI-SDR and speaker/phoneme similarity.
  • OM-LSA: Moderate SDR (10.9), weak on BWE/declipping.

7. Insights, Lessons Learned, and Open Challenges

Insights

  • Curated data quality is more impactful than brute-force dataset scaling.
  • Hybrid models exploiting generative model robustness and discriminative model precision set the new performance standard.
  • Non-intrusive and objective metrics have notable limitations for human perceptual alignment; as such, subjective human testing remains pivotal.
  • Multi-metric and ranking-based supervision notably improve MOS prediction.

Outstanding Challenges

  • Achieving real-time, low-latency universal SE for low-resource edge computing.
  • Extending robustness to novel distortion types (e.g., biomedical sensor artifacts).
  • Architectures unified for SE and downstream tasks (ASR, diarization).

Directions for the Future

  • Integration of multi-modal cues (e.g., video, lip reading) to further aid speech enhancement.
  • Expanding perceptual assessment to complex audio scenes (multi-talker, spatial audio).
  • Adopting additional perceptual metrics (e.g., P.563 non-intrusive, POLQA-EL).
  • Emphasis on self-supervised and unsupervised pre-training to minimize label dependence.

This suggests a systematic trend towards unified, data-efficient, and perceptually validated SE systems, evaluated on both technical and user-centric grounds (Li et al., 20 Jan 2026, Zhang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ICASSP 2026 URGENT Challenge.