The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results (2005.13981v3)

Published 16 May 2020 in eess.AS, cs.LG, and cs.SD

Abstract: The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. While the performance is good on the synthetic test set, often the model performance degrades significantly on real recordings. Also, most of the conventional objective metrics do not correlate well with subjective tests and lab subjective tests are not scalable for a large test set. In this challenge, we open-sourced a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings. We also open-sourced an online subjective test framework based on ITU-T P.808 for researchers to reliably test their developments. We evaluated the results using P.808 on a blind test set. The results and the key learnings from the challenge are discussed. The datasets and scripts can be found here for quick access https://github.com/microsoft/DNS-Challenge.

Citations (289)

View on Semantic Scholar

Summary

The paper introduces a novel DNS challenge leveraging open-source datasets and an online subjective testing framework based on ITU-T P.808.
It demonstrates significant improvements in subjective speech quality across real-time and non-real-time evaluation tracks.
The study emphasizes realistic noise synthesis for robust speech enhancement and informs future research directions.

Overview of the INTERSPEECH 2020 Deep Noise Suppression Challenge

The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge provides a structured pathway for researchers to advance the field of real-time single-channel Speech Enhancement (SE). Primarily centered around enhancing subjective speech quality, the challenge addresses pivotal problems in evaluating noise suppression techniques. Traditional approaches relying heavily on objective metrics often fail to align with subjective assessments, and these methods tend to degrade in real-world scenarios compared to synthetic test conditions.

Datasets and Methodology

The challenge introduced large-scale, open-source datasets and an online subjective testing framework adhering to the ITU-T P.808 standard. It provided a substantial corpus of clean speech and diverse noise environments to aid in training robust SE models. The dataset emulated realistic acoustic settings, thereby bridging the gap between synthetic and real-world performance.

Key components of the dataset include:

Clean Speech: Derived from the Librivox audiobooks, featuring 500 hours of high-quality recordings after stringent filtering based on Mean Opinion Score (MOS).
Noise Dataset: Amassed from sources such as Audioset and Freesound, balanced to maintain an extensive representation of 150 audio classes.
Noisy Speech: Synthesized by combining clean speech with noise at varying Signal-to-Noise Ratios (SNRs), ensuring realistic representation by using realistic recording conditions.

Evaluation Framework

The challenge proposed an objective evaluative structure to test SE models via the ITU-T P.808, a robust subjective evaluation method. This framework utilized Amazon Mechanical Turk (MTurk) to crowdsource assessments, maintaining evaluation accuracy and reliability through various control mechanisms like qualification tests for raters.

Competition Structure

The challenge was bifurcated into two tracks:

Real-Time (RT): Focused on low computational complexity models, ensuring efficient processing on standard hardware within defined temporal constraints.
Non-Real-Time (NRT): Allowed for unconstrained computational complexity, encouraging more complex model development for optimal speech quality.

Results and Findings

The challenge received 28 submissions from 19 teams, showcasing a diverse range of model architectures and training strategies:

Dataset Utilization: Participants leveraged the open-source datasets extensively, with some augmenting their training data for enhanced performance.
Challenge Outcomes: Strong models exhibited significant improvements in subjective speech quality, verified through a dual-phase testing process that ensured statistical validation of results.

Implications and Future Directions

This paper underscores the necessity for extensive and representative datasets in SE research. By providing a standardized testing framework, the challenge facilitates comparative performance analysis across varying SE methods. Future avenues may include advancements in speaker-specific noise suppression and the development of no-reference MOS predictors to streamline model evaluations.

In conclusion, the INTERSPEECH 2020 DNS Challenge marks a methodical advancement in SE, laying groundwork that future institutional and industrial research may build upon for further improvements in noise suppression technologies.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/DNS-Challenge: This repo contains the scripts, models, and required files for the Deep Noise Suppression (DNS) Challenge. (1,028 stars)