PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network (1911.04697v1)

Published 12 Nov 2019 in cs.SD and eess.AS

Abstract: Time-frequency (T-F) domain masking is a mainstream approach for single-channel speech enhancement. Recently, focuses have been put to phase prediction in addition to amplitude prediction. In this paper, we propose a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, for this task. Unlike previous methods that directly use a complex ideal ratio mask to supervise the DNN learning, we design a two-stream network, where amplitude stream and phase stream are dedicated to amplitude and phase prediction. We discover that the two streams should communicate with each other, and this is crucial to phase prediction. In addition, we propose frequency transformation blocks to catch long-range correlations along the frequency axis. The visualization shows that the learned transformation matrix spontaneously captures the harmonic correlation, which has been proven to be helpful for T-F spectrogram reconstruction. With these two innovations, PHASEN acquires the ability to handle detailed phase patterns and to utilize harmonic patterns, getting 1.76dB SDR improvement on AVSpeech + AudioSet dataset. It also achieves significant gains over Google's network on this dataset. On Voice Bank + DEMAND dataset, PHASEN outperforms previous methods by a large margin on four metrics.

PDF Abstract

Insights into the PHASEN Architecture for Speech Enhancement

The paper, "PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network," presents a novel approach to single-channel speech enhancement by introducing a dual-stream deep neural network architecture. This work innovatively tackles the challenges associated with phase prediction in the time-frequency (T-F) domain of speech signals, emphasizing the importance of phase recovery alongside amplitude enhancement.

Core Contributions

PHASEN introduces two significant improvements over traditional methods: a dual-stream network and the incorporation of frequency transformation blocks (FTBs). The dual-stream network segregates amplitude prediction and phase prediction into two distinct yet intercommunicating streams. Such a structure allows more efficient and targeted learning, addressing the inadequacies of single-stream methods where phase information is either underutilized or entirely neglected. The amplitude and phase streams exchange information bidirectionally—a key aspect that improves phase prediction significantly.

Additionally, the application of FTBs in the architecture is a notable enhancement. These blocks are designed to capture long-range correlations along the frequency axis, particularly focusing on harmonic correlations evident in T-F spectrograms. This capability allows PHASEN to effectively utilize harmonic relations—a feature not efficiently exploited by conventional convolutional neural networks—and proves essential in reconstructing detailed T-F patterns of speech signals.

Experimental Evaluation and Performance

In extensive experiments conducted on large datasets, such as AVSpeech + AudioSet and the Voice Bank + DEMAND corpus, PHASEN demonstrates marked improvement in performance over existing methods. Notably, PHASEN achieves a signal-to-distortion ratio (SDR) improvement of 1.76dB over standard approaches and shows significant gains in other metrics like PESQ, CSIG, and CBAK. These results emphasize PHASEN’s capability to enhance both amplitude and phase, recovering cleaner and more intelligible speech signals from noisy inputs.

The efficacy of the PHASEN architecture is further validated by its performance against various contemporary models. On the AVSpeech + AudioSet dataset, PHASEN surpasses Google’s audio-visual network under mono-audio conditions, achieving higher SDR with less training data and steps. The performance on the Voice Bank + DEMAND dataset also underscores its superiority over prominent time-domain and hybrid models like SEGAN and MDPhD.

Practical and Theoretical Implications

The contributions of PHASEN have substantial implications for both theoretical research and practical applications in speech processing. The dual-stream and FTB designs reflect a shift toward more specialized neural architectures capable of fine-grained processing of complex audio features. This approach paves the way for future research into the exploitation of harmonic structures and phase information, not just in enhancement tasks but also potentially extending to speech separation and recognition tasks.

From a practical perspective, the results suggest that PHASEN is suitable for complex real-world environments where speech needs to be extracted from highly noisy conditions. While the current architecture is not yet tailored for low-latency applications, the future adaptation to real-time processing scenarios could expand its utility in communication technologies like VOIP or live streaming.

Conclusion

The methodologies and results presented in PHASEN offer a promising avenue for advancing speech enhancement technologies. The strategic architectural enhancements focusing on phase information and harmonic correlations present an evolved understanding of the speech enhancement problem. Moving forward, adapting PHASEN for low-latency applications and exploring its extension to broader audio processing tasks could yield further advancements in the field of AI-driven audio enhancement.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Dacheng Yin (13 papers)
Chong Luo (58 papers)
Zhiwei Xiong (83 papers)
Wenjun Zeng (130 papers)

Citations (284)

View on Semantic Scholar