Multi-dimensional Speech Quality Assessment in Crowdsourcing (2309.07385v1)

Published 14 Sep 2023 in eess.AS and cs.SD

Abstract: Subjective speech quality assessment is the gold standard for evaluating speech enhancement processing and telecommunication systems. The commonly used standard ITU-T Rec. P.800 defines how to measure speech quality in lab environments, and ITU-T Rec.~P.808 extended it for crowdsourcing. ITU-T Rec. P.835 extends P.800 to measure the quality of speech in the presence of noise. ITU-T Rec. P.804 targets the conversation test and introduces perceptual speech quality dimensions which are measured during the listening phase of the conversation. The perceptual dimensions are noisiness, coloration, discontinuity, and loudness. We create a crowdsourcing implementation of a multi-dimensional subjective test following the scales from P.804 and extend it to include reverberation, the speech signal, and overall quality. We show the tool is both accurate and reproducible. The tool has been used in the ICASSP 2023 Speech Signal Improvement challenge and we show the utility of these speech quality dimensions in this challenge. The tool will be publicly available as open-source at https://github.com/microsoft/P.808.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a crowdsourcing-based toolkit that extends traditional ITU-T standards to enable multi-dimensional assessment of speech quality.
It validates the toolkit with reproducible experiments showing strong correlations with expert ratings for dimensions such as noisiness and overall quality.
The approach efficiently screens non-professional participants and offers a cost-effective method for large-scale evaluations in practical telecommunication challenges.

Multi-dimensional Speech Quality Assessment in Crowdsourcing

The paper "Multi-dimensional Speech Quality Assessment in Crowdsourcing" addresses the challenges associated with traditional speech quality assessment methods and presents a solution in the form of a crowdsourcing-based toolkit. Developed by Babak Naderi, Ross Cutler, and Nicolae-C\u{a}t\u{a}lin Ristea from Microsoft Corporation, this research leverages the flexibility and scalability of crowdsourcing to evaluate speech quality in audio telecommunication systems.

The paper begins by recognizing the limitations inherent in conventional lab-based subjective quality assessments, which are often slow and costly, thus making them impractical for large-scale evaluations. Building on existing standards such as ITU-T P.800 and its extensions (e.g., P.804, P.808, and P.835), the authors propose an enhanced crowdsourcing method that adheres to the recommendations of these standards.

Key Contributions

Toolkit Implementation: The authors have extended the P.808 Toolkit to include a multi-dimensional quality assessment template. This includes perceptual dimensions derived from ITU-T P.804 such as noisiness, coloration, discontinuity, loudness, and their extension to address reverberation, speech signal, and overall quality.
Validation and Reproducibility: The toolkit's results demonstrate a strong correlation with expert ratings and show reproducibility across multiple runs within the same experimental conditions. Particularly strong correlations were observed in model-level analyses for perceptual dimensions like noisiness and overall quality.
Application in Challenges: It was used in the ICASSP 2023 Speech Signal Improvement challenge, substantiating the toolkit's robustness and utility for real-world applications. The rankings obtained from this crowdsourcing method showed high consistency with traditional expert evaluations.

Technical Insights

A notable aspect of the toolkit is its potential for facilitating the screening of non-professional participants through preliminary tests that ensure their suitability for the paper. This includes verifying their device's bandwidth capabilities and their ability to discern perceptual differences in speech samples, thereby maintaining the integrity of the data collected.

The research also applies Exploratory Factor Analysis (EFA) to explore underlying relationships among quality dimensions. Results reveal a structured factor analysis with factors primarily representing signal quality, discontinuity, and noisiness. This suggests the multi-dimensional approach provides a more nuanced understanding of speech quality degradation.

Implications and Future Directions

The development of this toolkit has both practical and theoretical implications in the domain of speech processing and telecommunication systems. Practically, it provides an accessible, cost-effective method for large-scale speech quality assessment, essential for rapid development cycles in audio technologies. Theoretically, it allows for extensive data collection to refine our understanding of perceptual speech quality dimensions.

Potential future developments could involve refining the accuracy of ratings for complex dimensions such as coloration and reverberation, possibly by enhancing participant training or scaling descriptions. Additionally, integrating neural network-based analysis could further enhance the precision of non-intrusive, objective quality metrics compared to subjective evaluations.

In summary, this paper exemplifies a significant advancement in employing crowdsourcing for speech quality assessment, offering a viable alternative to traditional methods and paving the way for innovation in evaluation metrics in the field of speech enhancement and telecommunication systems.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/P.808: This is an open-source implementation of the ITU P.808 standard for "Subjective evaluation of speech quality with a crowdsourcing approach" (see https://www.itu.int/rec/T-REC-P.808/en). It uses Amazon Mechanical Turk as the crowdsourcing platform. It includes implementations for Absolute Category Rating (ACR), Degradation Category Rating (DCR), and Comparison Category Rating (CCR). (211 stars)