End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks (1709.03658v2)

Published 12 Sep 2017 in stat.ML, cs.LG, and cs.SD

Abstract: Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in most studies, there is an inconsistency between the model optimization criterion and the evaluation criterion on the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and evaluation criterion. Because of the utterance-based optimization, temporal correlation information of long speech segments, or even at the entire utterance level, can be considered when perception-based objective functions are used for the direct optimization. As an example, we implement the proposed FCN enhancement framework to optimize the STOI measure. Experimental results show that the STOI of test speech is better than conventional MMSE-optimized speech due to the consistency between the training and evaluation target. Moreover, by integrating the STOI in model optimization, the intelligibility of human subjects and automatic speech recognition (ASR) system on the enhanced speech is also substantially improved compared to those generated by the MMSE criterion.

Citations (267)

View on Semantic Scholar

Summary

The paper introduces a novel end-to-end FCN approach that directly optimizes STOI to enhance speech intelligibility.
The method leverages an utterance-based framework to preserve temporal dynamics and phase information by processing raw waveforms.
Results demonstrate robust ASR improvements with higher intelligibility despite a slight PESQ decrease, emphasizing a trade-off between quality and clarity.

End-to-End Waveform Utterance Enhancement Using Fully Convolutional Neural Networks

This paper investigates the critical problem of optimizing speech enhancement models with respect to evaluation metrics that reflect human perception more accurately. Historically, many speech enhancement systems have relied on minimizing the Mean Squared Error (MSE) between noisy and clean signals during training, which results in a mismatch between training objectives and evaluation criteria that heavily affects intelligibility metrics like the Short-Time Objective Intelligibility (STOI). This discrepancy often leads to suboptimal enhancement performance in practical applications.

The authors propose an end-to-end speech enhancement framework utilizing Fully Convolutional Networks (FCN), focusing on optimizing perceptually motivated metrics from the outset. By employing an utterance-based scheme, FCN enables the consideration of long-segment temporal correlation, thus aligning more closely with evaluation metrics such as STOI.

Key Contributions

End-to-End FCN Framework: The proposed FCN model operates directly on the waveform, eliminating the need for frequency-domain transformations which introduce additional computational complexity. The architecture allows for variable-length inputs, preserving local temporal structures, crucial for maintaining phase information that's typically unconsidered in magnitude-only approaches.
Direct Optimization of Intelligibility Metrics: The paper pioneers the integration of STOI within the training objective, as opposed to the traditional MMSE approach. This direct optimization resulted in higher STOI scores in tests, demonstrating enhanced intelligibility in human subjects and robust performance in Automatic Speech Recognition (ASR) systems.
Temporal and Spectral Advantages: By disregarding fixed input lengths and fully connected layers, the FCN maintains continuity in the audio with better temporal and spectral relationship preservation. This allows the model to address the detrimental padding effects and undealt intricacies significant in frame-based processing.

Findings and Implications

The experimental results showcased in the paper include robust improvements in STOI and ASR performance metrics when using utterance-based FCN, as compared to frame-based alternatives. Notably, the evaluation indicated that while PESQ scores (reflecting perceived audio quality) showed a slight decrease during STOI optimization, the intelligibility scores were substantially higher, which is more valuable for systems where speech clarity is priority over quality. The co-optimization of MSE and STOI objectives revealed improvements in intelligibility without detracting significantly from speech quality.

The authors argue that their findings substantiate the disconnect between intelligibility and traditional quality metrics, supporting the necessity of application-specific objective functions. This aligns with related works indicating high correlation coefficients between WER improvements and STOI, further demonstrating the predictive power of intelligibility metrics in ASR systems.

Future Developments

This paper lays groundwork for future exploration into specialized objective functions tuned for different applications, including multi-objectivity learning paradigms catering to divergent evaluation criteria. Furthermore, the FCN framework's versatility suggests possibilities for enhancements in robustness against diverse types of acoustic distortions beyond additive noise, such as reverberation and non-stationary noises.

In summary, this paper contributes to the ongoing discourse regarding the refinement of speech enhancement methodologies to bridge the gap between model training mechanisms and human perceptual realities. This advancement bears significant potential to redefine practices in real-world ASR and human-computer interaction spheres, where maintaining optimal intelligibility is vital.

PDF Markdown