Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep neural network techniques for monaural speech enhancement: state of the art analysis (2212.00369v2)

Published 1 Dec 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (264)
  1. Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, no. Mcmc, pp. 716–720, 2018.
  2. X. Xiao, Z. Chen, T. Yoshioka, H. Erdogan, C. Liu, D. Dimitriadis, J. Droppo, and Y. Gong, “Single-channel speech extraction using speaker inventory and attention network,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 86–90.
  3. Y. Wang and D. L. Wang, “Towards scaling up classification-based speech separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 7, pp. 1381–1390, 2013.
  4. Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014.
  5. M. N. Schmidt and R. K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” INTERSPEECH 2006 and 9th International Conference on Spoken Language Processing, INTERSPEECH 2006 - ICSLP, vol. 5, pp. 2614–2617, 2006.
  6. Z. Wang and F. Sha, “Discriminative Non-Negative Matrix Factorization For Single-Channel Speech Separation,” 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 3777–3781, 2014. [Online]. Available: https://pdfs.semanticscholar.org/854a/454106bd42a8bca158426d8b12b48ba0cae8.pdf
  7. T. Virtanen and A. T. Cemgil, “Mixtures of gamma priors for non-negative matrix factorization based speech separation,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5441, no. 3, pp. 646–653, 2009.
  8. T. Virtanen, “Speech recognition using factorial hidden Markov models for separation in the feature space,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 1, pp. 89–92, 2006.
  9. Y. Shao and D. Wang, “Model-based sequential organization in cochannel speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 1, pp. 289–298, 2006.
  10. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  11. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2021-June, pp. 21–25, 2021.
  12. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  13. S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 25, no. 4, pp. 692–730, 2017.
  14. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2016, pp. 31–35.
  15. Y. Wang, J. Du, L. R. Dai, and C. H. Lee, “A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 25, no. 7, pp. 1535–1546, 2017.
  16. Y. Wang, J. Du, L.-R. Dai, and C.-H. Lee, “Unsupervised single-channel speech separation via deep neural network for different gender mixtures,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).   IEEE, 2016, pp. 1–4.
  17. N. Zeghidour and D. Grangier, “Wavesplit: End-to-End Speech Separation by Speaker Clustering,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, no. iv, pp. 2840–2849, 2021.
  18. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep Learning For Monaural Speech Separation,” Acta Physica Polonica B, vol. 42, no. 1, pp. 33–44, 2011.
  19. C. Weng, D. Yu, M. L. Seltzer, and J. Droppo, “Deep neural networks for single-channel multi-talker speech recognition,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 23, no. 10, pp. 1670–1679, 2015.
  20. Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-Sept, pp. 545–549, 2016.
  21. M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  22. D. Yu, M. Kolbaek, Z. H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 241–245, 2017.
  23. H. Tachibana, “Towards listening to 10 people simultaneously: An efficient permutation invariant training of audio source separation using Sinkhorn’s algorithm,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2021-June, pp. 491–495, 2021.
  24. S. Dovrat, E. Nachmani, and L. Wolf, “Many-speakers single channel speech separation with optimal permutation training,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 4, pp. 2408–2412, 2021.
  25. J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep Clustering : Discriminative Embeddings For Segmentation And Separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 31–35, 2016.
  26. J. Byun and J. W. Shin, “Monaural speech separation using speaker embedding from preliminary separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, pp. 2753–2763, 2021.
  27. S. Qin, T. Jiang, S. Wu, N. Wang, and X. Zhao, “Graph convolution-based deep clustering for speech separation,” IEEE Access, vol. 8, pp. 82 571–82 580, 2020.
  28. J. H. Lee, J. H. Chang, J. M. Yang, and H. G. Moon, “NAS-TasNet: Neural Architecture Search for Time-Domain Speech Separation,” IEEE Access, vol. 10, pp. 56 031–56 043, 2022.
  29. F. Jiang and Z. Duan, “Speaker attractor network: Generalizing speech separation to unseen numbers of sources,” IEEE Signal Processing Letters, vol. 27, pp. 1859–1863, 2020.
  30. E. Nachmani, Y. Adi, and L. Wolf, “Voice separation with an unknown number of multiple speakers,” 37th International Conference on Machine Learning, ICML 2020, vol. PartF16814, pp. 7121–7132, 2020.
  31. Y. Liu and D. Wang, “Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 27, no. 12, pp. 2092–2102, 2019.
  32. Y. Luo and N. Mesgarani, “Separating varying numbers of sources with auxiliary autoencoding loss,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, pp. 2622–2626, 2020.
  33. J. Shi, J. Xu, G. Liu, and B. Xu, “Listen, think and listen again: Capturing top-down auditory attention for speaker-independent speech separation,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2018-July, pp. 4353–4360, 2018.
  34. K. Kinoshita, L. Drude, M. Delcroix, and T. Nakatani, “Listening to each speaker one by one with recurrent selective hearing networks,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 5064–5068, 2018.
  35. N. Takahashi, S. Parthasaarathy, N. Goswami, and Y. Mitsufuji, “Recursive speech separation for unknown number of speakers,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 1348–1352, 2019.
  36. T. V. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, pp. 91–95, 2019.
  37. T. von Neumann, C. Boeddeker, L. Drude, K. Kinoshita, M. Delcroix, T. Nakatani, and R. Haeb-Umbach, “Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, pp. 3097–3101, 2020.
  38. Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-Independent Speech Separation with Deep Attractor Network,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 26, no. 4, pp. 787–796, 2018.
  39. D. Yul, M. Kalbcek, Z.-h. Tan, and J. Jensen, “SPEAKER-INDEPENDENT MULTI-TALKER SPEECH SEPARATION,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 241–245.
  40. X. Chang, W. Zhang, Y. Qian, J. Le Roux, and S. Watanabe, “End-to-end multi-speaker speech recognition with transformer,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6134–6138.
  41. X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watanabe, “End-To-End Multi-Speaker Speech Recognition with Transformer,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, pp. 6134–6138, 2020.
  42. Z. Q. Wang, J. L. Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 686–690, 2018.
  43. K. Veselỳ, S. Watanabe, K. Žmolíková, M. Karafiát, L. Burget, and J. H. Černockỳ, “Sequence summarizing neural network for speaker adaptation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 5315–5319.
  44. J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” arXiv preprint arXiv:1807.08974, 2018.
  45. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” arXiv preprint arXiv:1810.04826, 2018.
  46. X. Ji, M. Yu, C. Zhang, D. Su, T. Yu, X. Liu, and D. Yu, “Speaker-aware target speaker enhancement by jointly learning with speaker embedding extraction,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7294–7298.
  47. C. Zhang, M. Yu, C. Weng, and D. Yu, “Towards robust speaker verification with target speaker enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6693–6697.
  48. M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2018, pp. 5554–5558.
  49. L. Chen, Z. Mo, J. Ren, C. Cui, and Q. Zhao, “An electroglottograph auxiliary neural network for target speaker extraction,” Applied Sciences, vol. 13, no. 1, p. 469, 2023.
  50. Y. Miao, H. Zhang, and F. Metze, “Speaker adaptive training of deep neural network acoustic models using i-vectors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11, pp. 1938–1949, 2015.
  51. A. Senior and I. Lopez-Moreno, “Improving dnn speaker independence with i-vector inputs,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2014, pp. 225–229.
  52. T. Ochiai, S. Matsuda, X. Lu, C. Hori, and S. Katagiri, “Speaker adaptive training using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2014, pp. 6349–6353.
  53. Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 246–250.
  54. D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM transactions on audio, speech, and language processing, vol. 25, no. 7, pp. 1492–1501, 2017.
  55. I. Arweiler and J. M. Buchholz, “The influence of spectral characteristics of early reflections on speech intelligibility,” The Journal of the Acoustical Society of America, vol. 130, no. 2, pp. 996–1005, 2011.
  56. A. K. Nábělek, T. R. Letowski, and F. M. Tucker, “Reverberant overlap- and self-masking in consonant identification,” Journal of the Acoustical Society of America, vol. 86, no. 4, pp. 1259–1265, 1989.
  57. R. Zhou, W. Zhu, and X. Li, “Single-Channel Speech Dereverberation using Subband Network with A Reverberation Time Shortening Target,” arXiv preprint arXiv:2210.11089, 2022. [Online]. Available: http://arxiv.org/abs/2204.08765
  58. T. Cord-Landwehr, C. Boeddeker, T. von Neumann, C. Zorila, R. Doddipatla, and R. Haeb-Umbach, “Monaural source separation: From anechoic to reverberant environments,” in In 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), 2021, pp. 1–5. [Online]. Available: http://arxiv.org/abs/2111.07578
  59. J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, no. 3, pp. 4506–4510, 2020.
  60. H.-S. Choi, H. Heo, J. H. Lee, and K. Lee, “Phase-aware Single-stage Speech Denoising and Dereverberation with U-Net,” arXiv preprint arXiv:2006.00687, 2020. [Online]. Available: http://arxiv.org/abs/2006.00687
  61. Z. Q. Wang, K. Tan, and D. Wang, “Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, pp. 71–75, 2019.
  62. K. Han, Y. Wang, D. L. Wang, W. S. Woods, I. Merks, and T. Zhang, “Learning Spectral Mapping for Speech Dereverberation and Denoising,” IEEE Transactions on Audio, Speech and Language Processing, vol. 23, no. 6, pp. 982–992, 2015.
  63. Y. Jiang, D. L. Wang, R. S. Liu, and Z. M. Feng, “Binaural classification for reverberant speech segregation using deep neural networks,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 22, no. 12, pp. 2112–2121, 2014.
  64. H. Gamper and I. J. Tashev, “Blind reverberation time estimation using a convolutional neural network,” 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018 - Proceedings, pp. 136–140, 2018.
  65. Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Monaural Speech Dereverberation Using Temporal Convolutional Networks with Self Attention,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 28, pp. 1598–1607, 2020.
  66. Y. Ueda, L. Wang, A. Kai, X. Xiao, E. S. Chng, and H. Li, “Single-channel Dereverberation for Distant-Talking Speech Recognition by Combining Denoising Autoencoder and Temporal Structure Normalization,” Journal of Signal Processing Systems, vol. 82, no. 2, pp. 151–161, 2016.
  67. D. S. Williamson and D. Wang, “Speech Dereverberation And Denoising Using Complex Ratio Masks ,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2017, pp. 5590–5594, 2017.
  68. Z. Jin and D. Wang, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 4, pp. 625–638, 2009.
  69. D. León and F. Tobar, “Late reverberation suppression using U-nets,” arXiv preprint arXiv:2110.02144., no. 1, 2021. [Online]. Available: http://arxiv.org/abs/2110.02144
  70. A. Défossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, pp. 3291–3295, 2020.
  71. U. Isik, R. Giri, N. Phansalkar, J. M. Valin, K. Helwani, and A. Krishnaswamy, “PoCoNet: Better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, pp. 2487–2491, 2020.
  72. A. Li, W. Liu, X. Luo, G. Yu, C. Zheng, and X. Li, “A simultaneous denoising and dereverberation framework with target decoupling,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2, pp. 796–800, 2021.
  73. J.-M. Valin, R. Giri, S. Venkataramani, U. Isik, and A. Krishnaswamy, “To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets,” arXiv:2206.07917, 2022. [Online]. Available: http://arxiv.org/abs/2206.07917
  74. S.-W. Fu, C. Yu, K.-H. Hung, M. Ravanelli, and Y. Tsao, “Metricgan-u: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7412–7416.
  75. S. Leglaive, U. Simsekli, A. Liutkus, L. Girin, and R. Horaud, “Speech Enhancement with Variational Autoencoders and Alpha-stable Distributions,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, pp. 541–545, 2019.
  76. S. Leglaive, L. Girin, and R. Horaud, “A variance modeling framework based on variational autoencoders for speech enhancement,” IEEE International Workshop on Machine Learning for Signal Processing, MLSP, vol. 2018-September, 2018.
  77. S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “A Recurrent Variational Autoencoder for Speech Enhancement,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, pp. 371–375, 2020.
  78. M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 25, no. 1, pp. 149–163, 2017.
  79. Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7402–7406.
  80. Y.-J. Lu, Y. Tsao, and S. Watanabe, “A study on speech enhancement based on diffusion probabilistic model,” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).   IEEE, 2021, pp. 659–666.
  81. S. W. Fu, T. W. Wang, Y. Tsao, X. Lu, H. Kawai, D. Stoller, S. Ewert, S. Dixon, X. Lu, Y. Tsao, S. Matsuda, C. Hori, Y. Xu, J. Du, L. R. Dai, C. H. Lee, T. Gao, J. Du, L. R. Dai, C. H. Lee, S. W. Fu, Y. Tsao, X. Lu, F. Weninger, J. R. Hershey, J. Le Roux, B. Schuller, Y. Xu, J. Du, L. R. Dai, C. H. Lee, F. Lluís, J. Pons, and X. Serra, “Speech enhancement based on deep denoising autoencoder,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-Sept, no. 1, pp. 7–19, 2018.
  82. S. W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neural network modeling for speech enhancement,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-Sept, pp. 3768–3772, 2016.
  83. T. Gao, J. Du, L. R. Dai, and C. H. Lee, “SNR-based progressive learning of deep neural network for speech enhancement,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-Sept, pp. 3713–3717, 2016.
  84. M. R. Portnoff, “Time-Frequency Representation of . Digital Signals,” IEEE Transactions on Acoustics, Speech and Signal Processing ASSP, vol. 28, no. 1, pp. 55–69, 1980.
  85. J. B. Allen, “Applications of the short time Fourier transform to speech processing and spectral analysis,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 1982-May, 1982, pp. 1012–1015.
  86. J. B. Allen and L. R. Rabiner, “A Unified Approach to Short-Time Fourier Analysis and Synthesis,” Proceedings of the IEEE, vol. 65, no. 11, pp. 1558–1564, 1977.
  87. K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465–494, 2011.
  88. A. Natsiou and S. O’Leary, “Audio representations for deep learning in sound synthesis: A review,” Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA, vol. 2021-Decem, 2021.
  89. S. A. Nossier, J. Wall, M. Moniri, C. Glackin, and N. Cannings, “A Comparative Study of Time and Frequency Domain Approaches to Deep Learning based Speech Enhancement,” in Proceedings of the International Joint Conference on Neural Networks, 2020.
  90. E. M. Grais and M. D. Plumbley, “Single channel audio source separation using convolutional denoising autoencoders,” 2017 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2017 - Proceedings, vol. 2018-Janua, pp. 1265–1269, 2018.
  91. S. W. Fu, C. F. Liao, Y. Tsao, and S. D. Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” 36th International Conference on Machine Learning, ICML 2019, vol. 2019-June, pp. 3566–3576, 2019.
  92. A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-Net convolutional networks,” Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017, pp. 745–751, 2017.
  93. M. Kim and P. Smaragdis, “Adaptive denoising autoencoders: A fine-tuning scheme to learn from test mixtures,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9237, pp. 100–107, 2015.
  94. Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani, “Deep clustering and conventional networks for music separation: Stronger together,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 61–65, 2017.
  95. D. Baby, T. Virtanen, T. Barker, and H. Van Hamme, “Coupled dictionary training for exemplar-based speech enhancement,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 2883–2887, 2014.
  96. S. W. Fu, T. Y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by convolutional neural network with multi-metrics learning,” IEEE International Workshop on Machine Learning for Signal Processing, MLSP, vol. 2017-September, pp. 1–6, 2017.
  97. V. Kothapally and J. H. Hansen, “Complex-valued time-frequency self-attention for speech dereverberation,” arXiv preprint arXiv:2211.12632, 2022.
  98. ——, “Skipconvgan: Monaural speech dereverberation using generative adversarial networks via complex time-frequency masking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1600–1613, 2022.
  99. H. Liu, X. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration,” arXiv preprint arXiv:2204.05841., no. September, pp. 4232–4236, 2022.
  100. Z. Du, X. Zhang, and J. Han, “A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 28, pp. 1493–1505, 2020.
  101. F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Discriminatively trained recurrent neural networks for single-channel speech separation,” 2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014, pp. 577–581, 2014.
  102. C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement with generative adversarial networks for robust speech recognition,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, no. Figure 1, pp. 5024–5028, 2018.
  103. J. Du and Q. Huo, “A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 569–572, 2008.
  104. Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 23, no. 1, pp. 7–19, 2015.
  105. J. Du, Y. Tu, Y. Xu, L. Dai, and C. H. Lee, “Speech separation of a target speaker based on deep neural networks,” International Conference on Signal Processing Proceedings, ICSP, vol. 2015-January, no. October, pp. 473–477, 2014.
  106. G. Garau and S. Renals, “Combining spectral representations for large-vocabulary continuous speech recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 3, pp. 508–518, 2008.
  107. A. Zolnay, D. Kocharov, R. Schlüter, and H. Ney, “Using multiple acoustic feature sets for speech recognition,” Speech Communication, vol. 49, no. 6, pp. 514–525, 2007.
  108. Y. Wang, K. Han, and D. Wang, “Exploring monaural features for classification-based speech segregation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 2, pp. 270–279, 2013.
  109. D. S. Williamson, Y. Wang, and D. L. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 24, no. 3, pp. 483–492, 2016.
  110. X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, no. August, pp. 436–440, 2013.
  111. Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Processing Letters, vol. 21, no. 1, pp. 65–68, 2014.
  112. A. Narayanan and D. Wang, “Ideal Ratio Mask Estimation Using Deep Neural Networks For Robust Speech Recognition,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096, 2013.
  113. H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2015-Augus, pp. 708–712, 2015.
  114. U. Kjems, J. B. Boldt, M. S. Pedersen, T. Lunner, and D. Wang, “Role of mask pattern in intelligibility of ideal binary-masked noisy speech,” The Journal of the Acoustical Society of America, vol. 126, no. 3, pp. 1415–1426, 2009.
  115. D. Wang, “Time – Frequency Masking for Speech Hearing Aid Design,” Trends In Amplification, vol. 12, pp. 332–353, 2008. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/18974204
  116. D. S. Brungart, P. S. Chang, B. D. Simpson, and D. Wang, “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” pp. 4007–4018, 2006.
  117. S. A. Nossier, J. Wall, M. Moniri, C. Glackin, and N. Cannings, “A Comparative Study of Time and Frequency Domain Approaches to Deep Learning based Speech Enhancement,” Proceedings of the International Joint Conference on Neural Networks, 2020.
  118. Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2015-Janua, 2015, pp. 3274–3278.
  119. P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 23, no. 12, pp. 2136–2147, 2015.
  120. E. M. Grais, M. U. Sen, and H. Erdogan, “Deep neural networks for single channel source separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 3734–3738, 2014.
  121. X. L. Zhang and D. Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 24, no. 5, pp. 967–977, 2016.
  122. A. Narayanan and D. Wang, “Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 23, no. 1, pp. 92–101, 2015.
  123. F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9237, pp. 91–99, 2015.
  124. I. Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” arXiv preprint arXiv, 2016. [Online]. Available: http://arxiv.org/abs/1701.00160
  125. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, no. Ml, pp. 1–14, 2014.
  126. T. D. Hien, D. V. Tuan, P. V. At, and L. H. Son, “Novel algorithm for non-negative matrix factorization,” New Mathematics and Natural Computation, vol. 11, no. 2, pp. 121–133, 2015.
  127. S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “A recurrent variational autoencoder for speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 371–375.
  128. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International Conference on Machine Learning.   PMLR, 2015, pp. 2256–2265.
  129. C. Luo, “Understanding diffusion models: A unified perspective,” arXiv preprint arXiv:2208.11970, 2022.
  130. Y. Luo and N. Mesgarani, “TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 696–700, 2018.
  131. A. Kumar and D. Florencio, “Speech enhancement in multiple-noise conditions using deep neural networks,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-September-2016, pp. 3738–3742, 2016.
  132. Y. Tu, J. Du, Y. Xu, L. Dai, and C. H. Lee, “Deep neural network based speech separation for robust speech recognition,” International Conference on Signal Processing Proceedings, ICSP, vol. 2015-January, no. October, pp. 532–536, 2014.
  133. H. Li, X. Zhang, H. Zhang, and G. Gao, “Integrated Speech Enhancement Method Based on Weighted Prediction Error and DNN for Dereverberation and Denoising,” arXiv preprint arXiv:1708.08251, no. 2, 2017. [Online]. Available: http://arxiv.org/abs/1708.08251
  134. Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984.
  135. J. S. Lim and A. V. Oppenheim, “Enhancement and Bandwidth Compression of Noisy Speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. 1586–1604, 1979.
  136. A. V. Oppenheim and J. S. Lim, “The Importance of Phase in Signals,” Proceedings of the IEEE, vol. 69, no. 5, pp. 529–541, 1981.
  137. P. Vary and M. Eurasip, “Noise suppression by spectral magnitude estimation -mechanism and theoretical limits-,” Signal Processing, vol. 8, no. 4, pp. 387–400, 1985.
  138. D. L. Wang and J. S. Lim, “The Unimportance of Phase in Speech Enhancement,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679–681, 1982.
  139. Y.-S. Lee, C.-Y. Wang, S.-F. Wang, J.-C. Wang, and C.-H. Wu, “Fully complex deep neural network for phase-incorporating monaural source separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 281–285.
  140. D. Gunawan and D. Sen, “Iterative phase estimation for the synthesis of separated sources from single-channel mixtures,” IEEE Signal Processing Letters, vol. 17, no. 5, pp. 421–424, 2010.
  141. Y. Ai, H. Li, X. Wang, J. Yamagishi, and Z. Ling, “Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation,” 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, pp. 477–484, 2021.
  142. J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Hershey, “Phasebook and friends: Leveraging discrete representations for source separation,” IEEE Journal on Selected Topics in Signal Processing, vol. 13, no. 2, pp. 370–382, 2019.
  143. N. Zheng and X. L. Zhang, “Phase-aware speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 27, no. 1, pp. 63–76, 2019.
  144. D. H. Friedman, “Instantaneous-Frequency Distribution Vs. Time: An Interpretation Of The Phase Structure Of Speech.” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1985, pp. 1121–1124.
  145. D. W. Griffin and J. S. Lim, “Signal Estimation from Modified Short-Time Fourier Transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
  146. Y. Zhao, Z. Q. Wang, and D. Wang, “Two-stage deep learning for noisy-reverberant speech enhancement,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 27, no. 1, pp. 53–62, 2019.
  147. K. Li, B. Wu, and C. H. Lee, “An iterative phase recovery framework with phase mask for spectral mapping with an application to speech enhancement,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-Sept, pp. 3773–3777, 2016.
  148. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, pp. 46–50, 2020.
  149. S. Venkataramani, J. Casebeer, and P. Smaragdis, “End-To-End Source Separation With Adaptive Front-Ends,” 2018 52nd Asilomar Conference on Signals, Systems, and Computers, no. 1, pp. 684–688, 2018.
  150. L. Zhang, Z. Shi, J. Han, A. Shi, and D. Ma, “FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11961 LNCS, pp. 653–665, 2020.
  151. E. Tzinis, S. Venkataramani, Z. Wang, C. Subakan, and P. Smaragdis, “Two-Step Sound Source Separation: Training on Learned Latent Targets,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, pp. 31–35, 2020.
  152. E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo RM -RF: Efficient networks for universal audio source separation,” IEEE International Workshop on Machine Learning for Signal Processing, MLSP, vol. 2020-Septe, 2020.
  153. Z. Kong, W. Ping, A. Dantrey, and B. Catanzaro, “Speech Denoising in the Waveform Domain With Self-Attention,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, pp. 7867–7871, 2022.
  154. M. W. Lam, J. Wang, D. Su, and D. Yuy, “Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2021-June, pp. 5759–5763, 2021.
  155. M. W. Lam, J. Wang, D. Su, and D. Yu, “Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks,” 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, pp. 801–808, 2021.
  156. I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R. Hershey, “Universal sound separation,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, vol. 2019-October, pp. 175–179, 2019.
  157. D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
  158. S. W. Fu, T. W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018.
  159. F. Lluís, J. Pons, and X. Serra, “End-to-end music source separation: Is it possible in the waveform domain?” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-Septe, pp. 4619–4623, 2019.
  160. S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement generative adversarial network,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-Augus, no. D, 2017, pp. 3642–3646.
  161. S. Pascual, J. Serrà, and A. Bonafonte, “Towards generalized speech enhancement with generative adversarial networks,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 1791–1795, 2019.
  162. H. Phan, I. V. McLoughlin, L. Pham, O. Y. Chen, P. Koch, M. De Vos, and A. Mertins, “Improving GANs for Speech Enhancement,” IEEE Signal Processing Letters, vol. 27, pp. 1700–1704, 2020.
  163. N. Adiga, Y. Pantazis, V. Tsiaras, and Y. Stylianou, “Speech enhancement for noise-robust speech synthesis using wasserstein gan.” in INTERSPEECH, 2019, pp. 1821–1825.
  164. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in neural information processing systems, vol. 30, 2017.
  165. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, pp. 1–15, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499
  166. F. Xiao, J. Guan, Q. Kong, and W. Wang, “Time-domain speech enhancement with generative adversarial learning,” arXiv preprint arXiv:2103.16149, 2021.
  167. Z. X. Li, L. R. Dai, Y. Song, and I. McLoughlin, “A Conditional Generative Model for Speech Enhancement,” Circuits, Systems, and Signal Processing, vol. 37, no. 11, pp. 5005–5022, 2018. [Online]. Available: https://doi.org/10.1007/s00034-018-0798-4
  168. S. Qin and T. Jiang, “Improved Wasserstein conditional generative adversarial network speech enhancement,” Eurasip Journal on Wireless Communications and Networking, vol. 2018, no. 1, 2018.
  169. K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and M. Hasegawa-Johnson, “Speech enhancement using bayesian wavenet.” in Interspeech, 2017, pp. 2013–2017.
  170. R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” arXiv preprint arXiv:2209.11112, pp. 936–940, 2022.
  171. K. Wang, B. He, and W. P. Zhu, “Tstnn: Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2021-June, pp. 7098–7102, 2021.
  172. J. Heitkaemper, D. Jakobeit, C. Boeddeker, L. Drude, and R. Haeb-Umbach, “Demystifying TasNet: A Dissecting Approach,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, pp. 6359–6363, 2020.
  173. F. Bahmaninezhad, J. Wu, R. Gu, S. X. Zhang, Y. Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: Spectrogram vs waveform separation,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 4574–4578, 2019.
  174. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “Global variance equalization for improving deep neural network based speech enhancement,” in 2014 IEEE China Summit & International Conference on Signal and Information Processing (ChinaSIP).   IEEE, 2014, pp. 71–75.
  175. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  176. C. Han, J. O’Sullivan, Y. Luo, J. Herrero, A. D. Mehta, and N. Mesgarani, “Speaker-independent auditory attention decoding without access to clean speech sources,” Science Advances, vol. 5, no. 5, pp. 1–12, 2019.
  177. T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2015-Janua, pp. 1–5, 2015.
  178. S. Parveen and P. Green, “Speech enhancement with missing data techniques using recurrent neural networks,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 1, no. Figure 1, pp. 13–16, 2004.
  179. G. Wichern and A. Lukin, “Low-Latency approximation of bidirectional recurrent networks for speech denoising,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, vol. 2017-October, pp. 66–70, 2017.
  180. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1724–1734, 2014.
  181. P. Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural audio source separation using deep convolutional neural networks,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10169 LNCS, pp. 258–266, 2017.
  182. J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-Octob, pp. 2642–2646, 2020.
  183. A. Gulati, J. Qin, C. C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, pp. 5036–5040, 2020.
  184. D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 5069–5073, 2018.
  185. Y. Li, X. Zhang, and D. Chen, “CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1091–1100, 2018.
  186. C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165.
  187. Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang, “DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 28, pp. 1404–1415, 2020.
  188. S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
  189. Y. He and J. Zhao, “Temporal Convolutional Networks for Anomaly Detection in Time Series,” Journal of Physics: Conference Series, vol. 1213, no. 4, 2019.
  190. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
  191. C. Subakan, M. Ravanelli, S. Cornell, F. Lepoutre, and F. Grondin, “Resource-Efficient Separation Transformer,” arXiv preprint arXiv:2206.09507, pp. 1–5, 2022. [Online]. Available: http://arxiv.org/abs/2206.09507
  192. I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The Long-Document Transformer,” arXiv preprint arXiv:, 2020. [Online]. Available: http://arxiv.org/abs/2004.05150
  193. S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-Attention with Linear Complexity,” arXiv preprint arXiv, vol. 2048, no. 2019, 2020. [Online]. Available: http://arxiv.org/abs/2006.04768
  194. N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The Efficient Transformer,” in International Conference on Learning Representations, 2020, pp. 1–12. [Online]. Available: http://arxiv.org/abs/2001.04451
  195. C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “On Using Transformers for Speech-Separation,” in International Workshop on Acoustic Signal Enhancement, vol. 14, no. 8, 2022, pp. 1–10. [Online]. Available: http://arxiv.org/abs/2202.02884
  196. J. Luo, J. Wang, N. Cheng, E. Xiao, X. Zhang, and J. Xiao, “Tiny-Sepformer: A Tiny Time-Domain Transformer Network For Speech Separation,” arXiv preprint arXiv:2206.13689, no. 1, pp. 5313–5317, 2022.
  197. S. Chen, Y. Wu, Z. Chen, J. Wu, T. Yoshioka, S. Liu, J. Li, and X. Yu, “Ultra fast speech separation model with teacher student learning,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 3, pp. 2298–2302, 2021.
  198. D. de Oliveira, T. Peer, and T. Gerkmann, “Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes,” arXiv preprint arXiv:2206.11703., no. 1, pp. 2948–2952, 2022.
  199. W. Qiu and Y. Hu, “Dual-Path Hybrid Attention Network for Monaural Speech Separation,” IEEE Access, vol. 10, pp. 78 754–78 763, 2022.
  200. S. Lutati, E. Nachmani, and L. Wolf, “SepIt: Approaching a Single Channel Speech Separation Bound,” in arXiv preprint arXiv:2205.11801, 2022, pp. 5323–5327.
  201. J. Y. Wu, C. Yu, S. W. Fu, C. T. Liu, S. Y. Chien, and Y. Tsao, “Increasing Compactness of Deep Learning Based Speech Enhancement Models with Parameter Pruning and Quantization Techniques,” IEEE Signal Processing Letters, vol. 26, no. 12, pp. 1887–1891, 2019.
  202. H. Sun and S. Li, “An optimization method for speech enhancement based on deep neural network,” IOP Conference Series: Earth and Environmental Science, vol. 69, no. 1, 2017.
  203. Y. C. Lin, Y. T. Hsu, S. W. Fu, Y. Tsao, and T. W. Kuo, “IA-Net: Acceleration and Compression of Speech Enhancement using Integer-adder Deep Neural Network,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 1801–1805, 2019.
  204. I. Fedorov, M. Stamenovic, C. Jensen, L. C. Yang, A. Mandell, Y. Gan, M. Mattina, and P. N. Whatmough, “TinyLSTMs: Efficient neural speech enhancement for hearing aids,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, pp. 4054–4058, 2020.
  205. Y. T. Hsu, Y. C. Lin, S. W. Fu, Y. Tsao, and T. W. Kuo, “A Study on Speech Enhancement Using Exponent-Only Floating Point Quantized Neural Network (EOFP-QNN),” 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, pp. 566–573, 2019.
  206. A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A Survey of Quantization Methods for Efficient Neural Network Inference,” Low-Power Computer Vision, pp. 291–326, 2022.
  207. K. R. Avery, J. Pan, C. C. Engler-Pinto, Z. Wei, F. Yang, S. Lin, L. Luo, and D. Konson, “Fatigue Behavior of Stainless Steel Sheet Specimens at Extremely High Temperatures,” SAE International Journal of Materials and Manufacturing, vol. 7, no. 3, pp. 560–566, 2014.
  208. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv preprint arXiv:1704.04861., 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
  209. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531 2.7, pp. 1–9, 2015. [Online]. Available: http://arxiv.org/abs/1503.02531
  210. J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
  211. L. Wang and K. J. Yoon, “Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3048–3068, 2022.
  212. R. Aihara, T. Hanazawa, Y. Okato, G. Wichern, and J. L. Roux, “Teacher-student Deep Clustering for Low-delay Single Channel Speech Separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, pp. 690–694, 2019.
  213. K. Tan and D. Wang, “Towards Model Compression for Deep Learning Based Speech Enhancement,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, pp. 1785–1794, 2021.
  214. F. Ye, Y. Tsao, and F. Chen, “Subjective feedback-based neural network pruning for speech enhancement,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2019, no. November, pp. 673–677, 2019.
  215. E. Dupuis, D. Novo, I. O’Connor, and A. Bosio, “Sensitivity Analysis and Compression Opportunities in DNNs Using Weight Sharing,” in Proceedings - 2020 23rd International Symposium on Design and Diagnostics of Electronic Circuits and Systems, DDECS 2020, 2020.
  216. X. Hu, K. Li, W. Zhang, Y. Luo, J. M. Lemercier, and T. Gerkmann, “Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network,” Advances in Neural Information Processing Systems, vol. 27, no. NeurIPS, pp. 22 509–22 522, 2021.
  217. H. Zhang, X. Zhang, and G. Gao, “Training Supervised Speech Separation System To Improve Stoi And Pesq Directly,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 5374–5378, 2018.
  218. P. C. Loizou and G. Kim, “Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 1, pp. 47–56, 2011.
  219. B. Xia and C. Bao, “Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification,” pp. 13–29, 2014.
  220. P. G. Shivakumar and P. Georgiou, “Perception optimized deep denoising autoencoders for speech enhancement,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08-12-September-2016, pp. 3743–3747, 2016.
  221. Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 81–85, 2017.
  222. M. Kolbcek, Z. H. Tan, and J. Jensen, “Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2018-April, pp. 5059–5063, 2018.
  223. Z. Yan, X. Buye, G. Ritwik, and Z. Tao, “Perceptually Guided Speech Enhancement Using Deep Neural Networks,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 5074–5078, 2018. [Online]. Available: https://cliffzhao.github.io/Publications/ZXGZ.icassp18.pdf%0Ahttp://arxiv.org/abs/1312.6114%0Ahttps://arxiv.org/pdf/1312.6114.pdf
  224. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  225. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - Half-baked or Well Done?” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, pp. 626–630, 2019.
  226. H. Li, K. Chen, L. Wang, J. Liu, B. Wan, and B. Zhou, “Sound source separation mechanisms of different deep networks explained from the perspective of auditory perception,” Applied Sciences, vol. 12, no. 2, p. 832, 2022.
  227. C. Fan, J. Tao, B. Liu, J. Yi, Z. Wen, and X. Liu, “End-to-End Post-Filter for Speech Separation with Deep Attention Fusion Features,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 28, pp. 1303–1314, 2020.
  228. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A Short-Time Objective Intelligibility Measure For Time-Frequency Weighted Noisy Speech,” IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4214–4217, 2010. [Online]. Available: http://cas.et.tudelft.nl/pubs/Taal2010.pdf
  229. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2, pp. 749–752, 2001.
  230. J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A deep learning loss function based on the perceptual evaluation of the speech quality,” IEEE Signal Processing Letters, vol. 25, no. 11, pp. 1680–1684, 2018.
  231. M. Kolbaek, Z. H. Tan, S. H. Jensen, and J. Jensen, “On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 28, pp. 825–838, 2020.
  232. X. Bie, S. Leglaive, X. Alameda-Pineda, and L. Girin, “Unsupervised speech enhancement using dynamical variational autoencoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2993–3007, 2022.
  233. T. Fujimura, Y. Koizumi, K. Yatabe, and R. Miyazaki, “Noisy-target training: A training strategy for dnn-based speech enhancement without clean speech,” in 2021 29th European Signal Processing Conference (EUSIPCO).   IEEE, 2021, pp. 436–440.
  234. S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson, and J. R. Hershey, “Unsupervised sound separation using mixture invariant training,” in Advances in Neural Information Processing Systems, vol. 2020-Decem, jun 2020. [Online]. Available: http://arxiv.org/abs/2006.12701
  235. J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “Teacher-student mixit for unsupervised and semi-supervised speech separation,” arXiv preprint arXiv:2106.07843, 2021.
  236. K. Saito, S. Uhlich, G. Fabbro, and Y. Mitsufuji, “Training speech enhancement systems with noisy speech datasets,” arXiv preprint arXiv:2105.12315, 2021.
  237. E. Karamatlı and S. Kırbız, “Mixcycle: Unsupervised speech separation via cyclic mixture permutation invariant training,” IEEE Signal Processing Letters, 2022.
  238. V. A. Trinh and S. Braun, “Unsupervised speech enhancement with speech recognition embedding and disentanglement losses,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 391–395.
  239. E. Tzinis, Y. Adi, V. K. Ithapu, B. Xu, P. Smaragdis, and A. Kumar, “Remixit: Continual self-training of speech enhancement models via bootstrapped remixing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1329–1341, 2022.
  240. L.-W. Chen, Y.-F. Cheng, H.-S. Lee, Y. Tsao, and H.-M. Wang, “A training and inference strategy using noisy and enhanced speech as target for speech enhancement without clean speech,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023, pp. 5315–5319.
  241. C. K. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6493–6497.
  242. Y. Xiang and C. Bao, “A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1826–1838, 2020.
  243. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  244. Y. Li, Y. Sun, K. Horoshenkov, and S. M. Naqvi, “Domain adaptation and autoencoder-based unsupervised speech enhancement,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 1, pp. 43–52, 2021.
  245. J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” arXiv preprint arXiv:1803.04189, 2018.
  246. S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE transactions on neural networks, vol. 22, no. 2, pp. 199–210, 2010.
  247. C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, “Noise adaptive speech enhancement using domain adversarial training,” arXiv preprint arXiv:1807.07501, 2018.
  248. Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li, “Unsupervised domain adaptation via domain adversarial training for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4889–4893.
  249. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “Cross-language transfer learning for deep neural network based speech enhancement,” in The 9th International Symposium on Chinese Spoken Language Processing.   IEEE, 2014, pp. 336–340.
  250. S. Pascual, M. Park, J. Serrà, A. Bonafonte, and K.-H. Ahn, “Language and noise transfer in speech enhancement generative adversarial network,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5019–5023.
  251. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
  252. Y. A. Chung, W. N. Hsu, H. Tang, and J. Glass, “An unsupervised autoregressive model for speech representation learning,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2019-September, pp. 146–150, 2019.
  253. Y. A. Chung, H. Tang, and J. Glass, “Vector-quantized autoregressive predictive coding,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, no. 1, pp. 3760–3764, 2020.
  254. A. T. Liu, S. W. Yang, P. H. Chi, P. C. Hsu, and H. Y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2020-May, pp. 6419–6423, 2020.
  255. A. T. Liu, S. W. Li, and H. Y. Lee, “TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, pp. 2351–2366, 2021.
  256. A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 2020-Decem, no. Figure 1, pp. 1–19, 2020.
  257. W. N. Hsu, B. Bolte, Y. H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 29, no. Cv, pp. 3451–3460, 2021.
  258. Z. Huang, S. Watanabe, S. W. Yang, P. García, and S. Khudanpur, “Investigating Self-Supervised Learning for Speech Enhancement and Separation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2022-May, pp. 6837–6841, 2022.
  259. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  260. K.-H. Hung, S.-w. Fu, H.-H. Tseng, H.-T. Chiang, Y. Tsao, and C.-W. Lin, “Boosting self-supervised embeddings for speech enhancement,” arXiv preprint arXiv:2204.03339, 2022.
  261. B. Irvin, M. Stamenovic, M. Kegler, and L.-C. Yang, “Self-supervised learning for speech enhancement through synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  262. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
  263. F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep feature losses,” arXiv preprint arXiv:1806.10522, 2018.
  264. X. Hao, C. Xu, and L. Xie, “Neural speech enhancement with unsupervised pre-training and mixture training,” Neural Networks, vol. 158, pp. 216–227, 2023.
Citations (11)

Summary

We haven't generated a summary for this paper yet.