DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors (2110.01763v4)

Published 5 Oct 2021 in eess.AS and cs.SD

Abstract: Human subjective evaluation is the gold standard to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy for subjective scores. We have recently developed a non-intrusive speech quality metric called Deep Noise Suppression Mean Opinion Score (DNSMOS) using the scores from ITU-T Rec. P.808 subjective evaluation. The P.808 scores reflect the overall quality of the audio clip. ITU-T Rec. P.835 subjective evaluation framework gives the standalone quality scores of speech and background noise in addition to the overall quality. In this work, we train an objective metric based on P.835 human ratings that outputs 3 scores: i) speech quality (SIG), ii) background noise quality (BAK), and iii) the overall quality (OVRL) of the audio. The developed metric is highly correlated with human ratings, with a Pearson's Correlation Coefficient (PCC)=0.94 for SIG and PCC=0.98 for BAK and OVRL. This is the first non-intrusive P.835 predictor we are aware of. DNSMOS P.835 is made publicly available as an Azure service.

Citations (170)

View on Semantic Scholar

Summary

The paper introduces DNSMOS P.835, a novel non-intrusive speech quality metric trained on P.835 subjective data to evaluate noise suppressors without reference audio.
DNSMOS P.835 demonstrates high correlation with human ratings (PCC of 0.94 for SIG, 0.98 for BAK/OVRL), validating its accuracy in predicting perceptual speech quality.
This metric provides a scalable alternative to subjective testing for evaluating noise suppressors, with potential real-world application as an Azure service.

DNSMOS P.835: Advancements in Non-Intrusive Speech Quality Assessment

The paper "DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors" by Chandan K A Reddy, Vishak Gopal, and Ross Cutler presents a significant contribution to the field of speech enhancement (SE) and quality assessment. This report introduces DNSMOS P.835, a novel non-intrusive speech quality metric designed to evaluate noise suppressors based on human perceptual assessment criteria.

Overview and Methodology

The DNSMOS P.835 model is developed from a dataset derived from the Deep Noise Suppression (DNS) Challenge, leveraging the ITU-T Rec. P.835 subjective evaluation framework. This framework provides separate quality assessments for speech (SIG), background noise (BAK), and overall audio quality (OVRL), which serve as the training labels for this non-intrusive metric. The model uses a Convolutional Neural Network (CNN) to predict these metrics precisely, with each audio clip being evaluated using its log power spectrogram as the input feature.

Significant Results

The DNSMOS P.835 model exhibits a high correlation with human ratings, boasting a Pearson's Correlation Coefficient (PCC) of 0.94 for SIG, and 0.98 for both BAK and OVRL, at the model level. These results underscore the model's robust capability in predicting human quality ratings across different noise suppression algorithms. Additionally, the authors make a notable claim about DNSMOS P.835 being the first non-intrusive P.835 predictor available as an Azure service, potentially broadening its application in real-world settings.

Implications and Future Directions

The implications of this research are two-fold. Practically, DNSMOS P.835 offers a scalable solution for evaluating noise suppressors, potentially impacting the development and optimization of SE systems without the need for extensive subjective testing. Theoretically, this paper enhances the understanding of integrating deep learning techniques with perceptual quality metrics, paving the way for more sophisticated predictive models in speech quality assessment.

Future developments, as suggested by the authors, may include expanding the model's complexity or enlarging the dataset with more diverse noise scenarios to further improve accuracy, especially at the clip level. Additionally, exploring different neural network architectures could enhance the fidelity of speech quality predictions.

Conclusion

DNSMOS P.835 represents a substantial advancement in the non-intrusive evaluation of speech quality under noisy conditions, aligning closely with human perceptual judgments. The research presents a credible alternative to traditional intrusive metrics, proposing a reliable tool for the continual improvement of noise suppression technologies. As machine learning and AI continue to evolve, such metrics will be crucial in developing more nuanced and user-centric audio processing solutions.