Raw Waveform-based Speech Enhancement by Fully Convolutional Networks (1703.02205v3)

Published 7 Mar 2017 in stat.ML, cs.LG, and cs.SD

Abstract: This study proposes a fully convolutional network (FCN) model for raw waveform-based speech enhancement. The proposed system performs speech enhancement in an end-to-end (i.e., waveform-in and waveform-out) manner, which dif-fers from most existing denoising methods that process the magnitude spectrum (e.g., log power spectrum (LPS)) only. Because the fully connected layers, which are involved in deep neural networks (DNN) and convolutional neural networks (CNN), may not accurately characterize the local information of speech signals, particularly with high frequency components, we employed fully convolutional layers to model the waveform. More specifically, FCN consists of only convolutional layers and thus the local temporal structures of speech signals can be efficiently and effectively preserved with relatively few weights. Experimental results show that DNN- and CNN-based models have limited capability to restore high frequency components of waveforms, thus leading to decreased intelligibility of enhanced speech. By contrast, the proposed FCN model can not only effectively recover the waveforms but also outperform the LPS-based DNN baseline in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). In addition, the number of model parameters in FCN is approximately only 0.2% compared with that in both DNN and CNN.

Authors (4)

Szu-Wei Fu (46 papers)
Yu Tsao (200 papers)
Xugang Lu (42 papers)
Hisashi Kawai (29 papers)

Citations (190)

View on Semantic Scholar

Summary

Raw Waveform-based Speech Enhancement by Fully Convolutional Networks

This paper presents a novel approach to speech enhancement (SE) by employing fully convolutional networks (FCNs) that operate directly on raw waveform inputs. The paper contrasts this methodology with traditional models such as deep neural networks (DNNs) and convolutional neural networks (CNNs) that primarily focus on the magnitude spectrum, in particular, the log power spectrum (LPS), thereby neglecting phase information critical for high-quality speech synthesis.

Methodology and Approach

The proposed FCN-based model processes the speech signal end-to-end from waveform input to enhanced waveform output. This design choice is crucial, as it avoids the conventional necessity of mapping the waveform to a frequency domain representation, which involves additional computation and potentially neglects local temporal structures. The FCN framework is characterized by its exclusive use of convolutional layers, thereby preserving localized temporal information efficiently and reducing model complexity, as evidenced by a significantly smaller parameter space (only 0.2% of that in DNN and CNN models).

Experimental Findings

Quantitative results from the paper indicate that FCNs outperform DNNs and CNNs, particularly in recovering high-frequency components of the speech signal—a crucial factor for speech intelligibility and quality. The experimental evaluation includes metrics such as short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). FCNs achieved higher STOI scores compared to both LPS-based DNNs and CNNs, suggesting superior performance in maintaining speech intelligibility. Additionally, PESQ scores demonstrate that FCNs enhance speech quality better than waveform-based DNN models.

Further qualitative analysis reveals that fully connected layers, integral to DNN and CNN architectures, constrain the simultaneous modeling of high and low-frequency waveform components due to inherent structural limitations. The FCN's ability to locally connect output samples with neighboring inputs effectively alleviates these constraints, as evidenced by their superior handling of high-frequency constituents.

Implications and Future Directions

The adoption of FCNs for speech enhancement introduces significant implications for future research and applications. The reduced model complexity renders the approach suitable for deployment in resource-constrained environments, such as mobile devices. Furthermore, by discarding fully connected layers, the FCN model not only retains local temporal information but also reduces the risk of overfitting, leading to more robust and generalized speech enhancement solutions.

The research suggests future exploration into utterance-based processing rather than frame-wise operations to further optimize enhancement processes. This direction could ensure that correlations across entire utterances are maintained, potentially further improving both intelligibility and speech quality.

Conclusion

The paper successfully demonstrates the efficacy of utilizing FCNs for waveform-based speech enhancement, highlighting their advantages over traditional DNN and CNN architectures that operate primarily in the frequency domain. By focusing on raw waveform inputs, the FCN model advances the capability to enhance speech signals more effectively and with reduced computational demand, setting the stage for further innovation in the domain of speech processing and enhancement.

PDF Markdown

Related Papers

Find Related Papers