A Two-Stage Framework in Cross-Spectrum Domain for Real-Time Speech Enhancement (2401.10494v1)
Abstract: Two-stage pipeline is popular in speech enhancement tasks due to its superiority over traditional single-stage methods. The current two-stage approaches usually enhance the magnitude spectrum in the first stage, and further modify the complex spectrum to suppress the residual noise and recover the speech phase in the second stage. The above whole process is performed in the short-time Fourier transform (STFT) spectrum domain. In this paper, we re-implement the above second sub-process in the short-time discrete cosine transform (STDCT) spectrum domain. The reason is that we have found STDCT performs greater noise suppression capability than STFT. Additionally, the implicit phase of STDCT ensures simpler and more efficient phase recovery, which is challenging and computationally expensive in the STFT-based methods. Therefore, we propose a novel two-stage framework called the STFT-STDCT spectrum fusion network (FDFNet) for speech enhancement in cross-spectrum domain. Experimental results demonstrate that the proposed FDFNet outperforms the previous two-stage methods and also exhibits superior performance compared to other advanced systems.
- “Real Time Speech Enhancement in the Waveform Domain,” in Proc. Interspeech 2020, 2020, pp. 3291–3295.
- Jean-Marc Valin, “A hybrid dsp/deep learning approach to real-time full-band speech enhancement,” in 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), 2018, pp. 1–5.
- “Real-time speech enhancement using equilibriated rnn,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 851–855.
- “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Interspeech 2020, 2020, pp. 2472–2476.
- “Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1829–1843, 2021.
- Ke Tan and DeLiang Wang, “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement,” in Proc. Interspeech 2018, 2018, pp. 3229–3233.
- “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465–494, 2011.
- “Phasen: A phase-and-harmonics-aware speech enhancement network,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 9458–9465, Apr. 2020.
- “Phase-aware speech enhancement with deep complex u-net,” in International Conference on Learning Representations, 2019.
- “A Simultaneous Denoising and Dereverberation Framework with Target Decoupling,” in Proc. Interspeech 2021, 2021, pp. 2801–2805.
- “Discrete cosine transform,” IEEE Transactions on Computers, vol. C-23, no. 1, pp. 90–93, 1974.
- “Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.
- “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech.,” in SSW, 2016, pp. 146–152.
- “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001, vol. 2, pp. 749–752 vol.2.
- Yi Hu and Philipos C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008.
- “Real-time monaural speech enhancement with short-time discrete cosine transform,” 2021.
- “A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech,” in Proc. Interspeech 2020, 2020, pp. 2482–2486.
- “Deepmmse: A deep learning approach to mmse-based noise power spectral density estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1404–1415, 2020.
- “Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement,” in Proc. Interspeech 2022, 2022, pp. 921–925.
- “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” Applied Acoustics, vol. 187, pp. 108499, 2022.