Unrestricted Global Phase Bias-Aware Single-channel Speech Enhancement with Conformer-based Metric GAN (2402.08252v2)
Abstract: With the rapid development of neural networks in recent years, the ability of various networks to enhance the magnitude spectrum of noisy speech in the single-channel speech enhancement domain has become exceptionally outstanding. However, enhancing the phase spectrum using neural networks is often ineffective, which remains a challenging problem. In this paper, we found that the human ear cannot sensitively perceive the difference between a precise phase spectrum and a biased phase (BP) spectrum. Therefore, we propose an optimization method of phase reconstruction, allowing freedom on the global-phase bias instead of reconstructing the precise phase spectrum. We applied it to a Conformer-based Metric Generative Adversarial Networks (CMGAN) baseline model, which relaxes the existing constraints of precise phase and gives the neural network a broader learning space. Results show that this method achieves a new state-of-the-art performance without incurring additional computational overhead.
- Speech enhancement, Springer Science & Business Media, 2006.
- Automatic speech recognition, vol. 1, Springer, 2016.
- “Effect of noise suppression losses on speech distortion and asr performance,” in Proc. ICASSP, 2022, pp. 996–1000.
- “Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in Proc. ICML, 2019, pp. 2031–2041.
- “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Proc. Interspeech, 2021, pp. 201–205.
- “Dual-branch attention-in-attention transformer for single-channel speech enhancement,” in Proc. ICASSP, 2022, pp. 7847–7851.
- “MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra,” in Proc. Interspeech, 2023, pp. 3834–3838.
- “Phasen: A phase-and-harmonics-aware speech enhancement network,” in Proc. AAAI, 2020, vol. 34, no. 05, pp. 9458–9465.
- “Using separate losses for speech and noise in mask-based speech enhancement,” in Proc. ICASSP, 2020, pp. 7519–7523.
- “Time-frequency masking-based speech enhancement using generative adversarial network,” in Proc. ICASSP, 2018, pp. 5039–5043.
- Ke Tan and DeLiang Wang, “Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement,” in Proc. ICASSP, 2019, pp. 6865–6869.
- Ke Tan and DeLiang Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE Trans. ASLP, vol. 28, pp. 380–390, 2019.
- “Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement,” in Proc. Interspeech, 2021, pp. 196–200.
- “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Interspeech, 2020, pp. 2472–2476.
- “Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement,” in Proc. ICASSP, 2022, pp. 6857–6861.
- “TridentSE: Guiding Speech Enhancement with 32 Global Tokens,” in Proc. Interspeech, 2023, pp. 3839–3843.
- “SEGAN: Speech Enhancement Generative Adversarial Network,” in Proc. Interspeech, 2017, pp. 3642–3646.
- “Se-conformer: Time-domain speech enhancement using conformer.,” in Proc. Interspeech, 2021, pp. 2736–2740.
- Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP, 2018, pp. 696–700.
- Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE Trans. ASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
- “Signal estimation from modified short-time fourier transform,” IEEE Trans. ASLP, vol. 32, no. 2, pp. 236–243, 1984.
- “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023, pp. 1–5.
- “Phase continuity: Learning derivatives of phase spectrum for speech enhancement,” in Proc. ICASSP, 2022, pp. 6942–6946.
- “Deep griffin–lim iteration,” in Proc. ICASSP, 2019, pp. 61–65.
- “Inter-frequency phase difference for phase reconstruction using deep neural networks and maximum likelihood,” IEEE Trans. ASLP, 2023.
- “Two-stage phase reconstruction using dnn and von mises distribution-based maximum likelihood,” in Proc. APSIPA, 2021, pp. 995–999.
- “Deep griffin–lim iteration: Trainable iterative phase reconstruction using neural network,” JSTSP, vol. 15, no. 1, pp. 37–50, 2020.
- “Sensitivity of human hearing to changes in phase spectrum,” JAES, vol. 61, no. 11, pp. 860–877, 2013.
- Roy D Patterson, “A pulse ribbon model of monaural phase perception,” JASA, vol. 82, no. 5, pp. 1560–1586, 1987.
- “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, vol. 2, pp. 749–752.
- “An effective quality evaluation protocol for speech enhancement algorithms,” in Proc. ICSLP, 1998.
- “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” in Proc. Interspeech, 2022, pp. 936–940.
- “Online phase reconstruction via dnn-based phase differences estimation,” IEEE Trans. ASLP, vol. 31, pp. 163–176, 2022.
- “Speech analysis using instantaneous frequency deviation,” in Proc. Interspeech, 2008.
- “Recurrent phase reconstruction using estimated phase derivatives from deep neural networks,” in Proc. ICASSP, 2021, pp. 7088–7092.
- “Manner: Multi-view attention network for noise erasure,” in Proc. ICASSP, 2022, pp. 7842–7846.
- “D 2 net: A denoising and dereverberation network based on two-branch encoder and dual-path transformer,” in Proc. APSIPA, 2022, pp. 1649–1654.
- “D2former: A fully complex dual-path dual-decoder conformer network using joint complex masking and complex spectral mapping for monaural speech enhancement,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech.,” in Proc. SSW, 2016, pp. 146–152.
- “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010, pp. 4214–4217.
- Yi Hu and Philipos C Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. ASLP, vol. 16, no. 1, pp. 229–238, 2007.