Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Deep Representation Learning-based Speech Enhancement Method Using Complex Convolution Recurrent Variational Autoencoder (2312.09620v1)

Published 15 Dec 2023 in eess.AS

Abstract: Generally, the performance of deep neural networks (DNNs) heavily depends on the quality of data representation learning. Our preliminary work has emphasized the significance of deep representation learning (DRL) in the context of speech enhancement (SE) applications. Specifically, our initial SE algorithm employed a gated recurrent unit variational autoencoder (VAE) with a Gaussian distribution to enhance the performance of certain existing SE systems. Building upon our preliminary framework, this paper introduces a novel approach for SE using deep complex convolutional recurrent networks with a VAE (DCCRN-VAE). DCCRN-VAE assumes that the latent variables of signals follow complex Gaussian distributions that are modeled by DCCRN, as these distributions can better capture the behaviors of complex signals. Additionally, we propose the application of a residual loss in DCCRN-VAE to further improve the quality of the enhanced speech. {Compared to our preliminary work, DCCRN-VAE introduces a more sophisticated DCCRN structure and probability distribution for DRL. Furthermore, in comparison to DCCRN, DCCRN-VAE employs a more advanced DRL strategy. The experimental results demonstrate that the proposed SE algorithm outperforms both our preliminary SE framework and the state-of-the-art DCCRN SE method in terms of scale-invariant signal-to-distortion ratio, speech quality, and speech intelligibility.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. S. E. Eskimez, X. Wang, M. Tang, H. Yang, Z. Zhu, Z. Chen, H. Wang, and T. Yoshioka, “Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement,” in Proc. Interspeech, 2021, pp. 2686–2690.
  2. K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How bad are artifacts?: Analyzing the impact of speech enhancement errors on asr,” arXiv preprint arXiv:2201.06685, 2022.
  3. J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE Trans. Acoust., Speech, Signal Process., vol. 26, no. 3, pp. 197–210, 1978.
  4. K. B. Christensen, M. G. Christensen, J. B. Boldt, and F. Gran, “Experimental study of generalized subspace filters for the cocktail party situation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.   IEEE, 2016, pp. 420–424.
  5. S. Srinivasan, J. Samuelsson, and W. B. Kleijn, “Codebook driven short-term predictor parameter estimation for speech enhancement,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 14, no. 1, pp. 163–176, 2005.
  6. N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 21, no. 10, pp. 2140–2151, 2013.
  7. Y. Xiang, L. Shi, J. L. Højvang, M. H. Rasmussen, and M. G. Christensen, “An NMF-HMM speech enhancement method based on kullback-leibler divergence,” in Proc. Interspeech, 2020, pp. 2667–2671.
  8. ——, “A novel NMF-HMM speech enhancement algorithm based on poisson mixture model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.   IEEE, 2021, pp. 721–725.
  9. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 26, no. 10, pp. 1702–1726, 2018.
  10. L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-target deep learning for lstm-rnn based speech enhancement,” in 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), 2017, pp. 136–140.
  11. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
  12. Y. Xiang and C. Bao, “A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 28, pp. 1826–1838, 2020.
  13. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Interspeech, 2020, pp. 2472–2476.
  14. Z.-Q. Wang, G. Wichern, and J. Le Roux, “On the compensation between magnitude and phase in speech separation,” IEEE Signal Processing Letters, vol. 28, pp. 2018–2022, 2021.
  15. Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.
  16. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65–68, 2013.
  17. Y.-H. Tu, J. Du, and C.-H. Lee, “Speech enhancement based on teacher–student deep learning using improved speech presence probability for noise-robust speech recognition,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 27, no. 12, pp. 2080–2091, 2019.
  18. S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113–120, 1979.
  19. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 23, no. 1, pp. 7–19, 2014.
  20. K. Chan, Y. Yu, C. You, H. Qi, J. Wright, and Y. R. Ma, “A white-box deep network from the principle of maximizing rate reduction. arxiv 2021,” arXiv preprint arXiv:2105.10446, 2021.
  21. Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  22. X. Dai, S. Tong, M. Li, Z. Wu, K. H. R. Chan, P. Zhai, Y. Yu, M. Psenka, X. Yuan, H. Y. Shum et al., “Closed-loop data transcription to an ldr via minimaxing rate reduction,” arXiv preprint arXiv:2111.06636, 2021.
  23. Y. Xie, T. Arildsen, and Z.-H. Tan, “Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective,” in proc. IEEE International Workshop on Machine Learning for Signal Processing, 2021, pp. 1–6.
  24. G. Carbajal, J. Richter, and T. Gerkmann, “Guided variational autoencoder for speech enhancement with a supervised classifier,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 681–685.
  25. H. Fang, G. Carbajal, S. Wermter, and T. Gerkmann, “Variational autoencoder for speech enhancement with a noise-aware encoder,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 676–680.
  26. G. Carbajal, J. Richter, and T. Gerkmann, “Disentanglement learning for variational autoencoders applied to audio-visual speech enhancement,” in Proc. IEEE Workshop Appl. of Signal Process. to Aud. and Acoust.   IEEE, 2021, pp. 126–130.
  27. D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv preprint arXiv:1312.6114, 2013.
  28. Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 716–720.
  29. Y. Xiang, J. L. Højvang, M. H. Rasmussen, and M. G. Christensen, “A bayesian permutation training deep representation learning method for speech enhancement with variational autoencoder,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 381–385.
  30. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 1562–1566.
  31. I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-vae: Learning basic visual concepts with a constrained variational framework,” International Conference on Learning Representations, 2017.
  32. Y. Xiang, J. L. Højvang, M. H. Rasmussen, and M. G. Christensen, “A deep representation learning speech enhancement method using β𝛽\betaitalic_β-vae,” in 2022 30th European Signal Processing Conference (EUSIPCO).   IEEE, 2022, pp. 359–363.
  33. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. Advances in Neural Inform. Process. Syst., 2014, pp. 2672–2680.
  34. Y. Xiang, J. L. Højvang, M. H. Rasmussen, and M. G. Christensen, “A two-stage deep representation learning-based speech enhancement method using variational autoencoder and adversarial training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 164–177, 2024.
  35. T. Nakashika, “Complex-valued variational autoencoder: A novel deep generative model for direct representation of complex spectra.” in INTERSPEECH, 2020, pp. 2002–2006.
  36. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  37. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 626–630.
  38. S. Braun and I. Tashev, “Data augmentation and loss normalization for deep noise suppression,” in International Conference on Speech and Computer.   Springer, 2020, pp. 79–86.
  39. Y. Xia, S. Braun, C. K. Reddy, H. Dubey, R. Cutler, and I. Tashev, “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.   IEEE, 2020, pp. 871–875.
  40. C. K. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan, “Interspeech 2021 deep noise suppression challenge,” in Proc. Interspeech, 2021.
  41. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011.
  42. C. K. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP, 2021.
  43. ——, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yang Xiang (187 papers)
  2. Jingguang Tian (9 papers)
  3. Xinhui Hu (15 papers)
  4. Xinkang Xu (12 papers)
  5. ZhaoHui Yin (4 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.