NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets (2104.09494v1)

Published 19 Apr 2021 in eess.AS, cs.AI, cs.LG, and cs.SD

Abstract: In this paper, we present an update to the NISQA speech quality prediction model that is focused on distortions that occur in communication networks. In contrast to the previous version, the model is trained end-to-end and the time-dependency modelling and time-pooling is achieved through a Self-Attention mechanism. Besides overall speech quality, the model also predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness, and in this way gives more insight into the cause of a quality degradation. Furthermore, new datasets with over 13,000 speech files were created for training and validation of the model. The model was finally tested on a new, live-talking test dataset that contains recordings of real telephone calls. Overall, NISQA was trained and evaluated on 81 datasets from different sources and showed to provide reliable predictions also for unknown speech samples. The code, model weights, and datasets are open-sourced.

Citations (183)

View on Semantic Scholar

Summary

The paper introduces a CNN-self-attention model that predicts overall MOS and key speech quality dimensions like noisiness and coloration.
It employs a four-stage pipeline with Mel-Spec segmentation, CNN feature extraction, and Attention-Pooling to capture temporal dependencies.
The model outperforms traditional methods on a large dataset from VoIP services, highlighting its practical impact on real-time speech quality assessment.

NISQA: A CNN-Self-Attention Model for Speech Quality Prediction

The paper introduces an enhanced version of the Neural Network-based Instrumental Speech Quality Assessor (NISQA) model, with a focus on predicting speech quality across various dimensions using deep learning methodologies. This development seeks to address challenges inherent in assessing speech quality in communication networks, particularly under modern VoIP conditions, which are inadequately covered by existing models like P.563.

Model Architecture and Development

NISQA embraces a sophisticated architectural design composed of four integral stages: Mel-Spec segmentation, a Convolutional Neural Network (CNN) for framewise modeling, a Self-Attention mechanism for capturing time dependencies, and an Attention-Pooling layer for aggregating features. This configuration allows the model to effectively predict not only the overall Mean Opinion Score (MOS) but also the distinct speech quality dimensions of Noisiness, Coloration, Discontinuity, and Loudness.

The CNN is tasked with feature extraction from Mel-spectrograms, followed by the Self-Attention network that models temporal relationships between feature sequences. The Attention-Pooling layer is strategic in weighting quality-relevant aspects of the input signal, considering factors like the recency effect and degradation sensitivity, which are critical for accurate quality predictions in variable-length speech samples.

Dataset and Evaluation

The authors constructed an extensive dataset comprising over 13,000 speech files, which importantly includes both simulated and real-world distortion scenarios from VoIP services like Skype and Google Meet. The model's robustness was tested across 81 datasets, ensuring generalizability to unseen data. The open-sourcing of these datasets, alongside the model's code and weights, is a significant contribution facilitating further research and model enhancements.

Numerical Results and Comparisons

The model displayed promising results, often surpassing traditional models such as P.563 and ANIQUE+ in predicting speech quality without needing a clean reference signal. The improved correlation coefficients and reduced RMSE values underscore the effectiveness of NISQA when compared with established models across validation datasets. However, double-ended models like POLQA still exhibit advantages in specific scenarios, particularly those aligned with ITU-T P.800 test conditions.

Implications and Future Directions

The implications of this research are twofold. Practically, it provides a deployable, real-time capable solution for monitoring and improving communication system quality by diagnosing specific degradation causes. Theoretically, the adoption of Self-Attention mechanisms signifies a forward-thinking approach to temporal modeling in speech processing tasks.

Future developments may see adaptive versions of NISQA tailored for enhanced or synthesized speech, addressing a broader spectrum of applications. Moreover, exploration into expanding the perceptual dimensions to further deconstruct quality impairments could provide even more granular insights.

The paper successfully integrates deep learning innovations within the speech quality assessment domain, offering a scalable and robust tool for modern communication challenges while setting a foundational benchmark for subsequent advancements in AI-driven quality assessment technologies.

PDF Markdown

Related Papers

GitHub

GitHub - gabrielmittag/NISQA: NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment (808 stars)