Raw Waveform-based Speech Enhancement by Fully Convolutional Networks
This paper presents a novel approach to speech enhancement (SE) by employing fully convolutional networks (FCNs) that operate directly on raw waveform inputs. The paper contrasts this methodology with traditional models such as deep neural networks (DNNs) and convolutional neural networks (CNNs) that primarily focus on the magnitude spectrum, in particular, the log power spectrum (LPS), thereby neglecting phase information critical for high-quality speech synthesis.
Methodology and Approach
The proposed FCN-based model processes the speech signal end-to-end from waveform input to enhanced waveform output. This design choice is crucial, as it avoids the conventional necessity of mapping the waveform to a frequency domain representation, which involves additional computation and potentially neglects local temporal structures. The FCN framework is characterized by its exclusive use of convolutional layers, thereby preserving localized temporal information efficiently and reducing model complexity, as evidenced by a significantly smaller parameter space (only 0.2% of that in DNN and CNN models).
Experimental Findings
Quantitative results from the paper indicate that FCNs outperform DNNs and CNNs, particularly in recovering high-frequency components of the speech signal—a crucial factor for speech intelligibility and quality. The experimental evaluation includes metrics such as short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). FCNs achieved higher STOI scores compared to both LPS-based DNNs and CNNs, suggesting superior performance in maintaining speech intelligibility. Additionally, PESQ scores demonstrate that FCNs enhance speech quality better than waveform-based DNN models.
Further qualitative analysis reveals that fully connected layers, integral to DNN and CNN architectures, constrain the simultaneous modeling of high and low-frequency waveform components due to inherent structural limitations. The FCN's ability to locally connect output samples with neighboring inputs effectively alleviates these constraints, as evidenced by their superior handling of high-frequency constituents.
Implications and Future Directions
The adoption of FCNs for speech enhancement introduces significant implications for future research and applications. The reduced model complexity renders the approach suitable for deployment in resource-constrained environments, such as mobile devices. Furthermore, by discarding fully connected layers, the FCN model not only retains local temporal information but also reduces the risk of overfitting, leading to more robust and generalized speech enhancement solutions.
The research suggests future exploration into utterance-based processing rather than frame-wise operations to further optimize enhancement processes. This direction could ensure that correlations across entire utterances are maintained, potentially further improving both intelligibility and speech quality.
Conclusion
The paper successfully demonstrates the efficacy of utilizing FCNs for waveform-based speech enhancement, highlighting their advantages over traditional DNN and CNN architectures that operate primarily in the frequency domain. By focusing on raw waveform inputs, the FCN model advances the capability to enhance speech signals more effectively and with reduced computational demand, setting the stage for further innovation in the domain of speech processing and enhancement.