Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech enhancement with frequency domain auto-regressive modeling (2309.13537v1)

Published 24 Sep 2023 in eess.AS, cs.AI, and cs.SD

Abstract: Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performance using the approach of envelope-carrier decomposition provided by an autoregressive (AR) model. The AR model is applied in the frequency domain of the sub-band speech signals to separate the envelope and carrier parts. A novel neural architecture based on dual path long short term memory (DPLSTM) model is proposed, which jointly enhances the sub-band envelope and carrier components. The dereverberated envelope-carrier signals are modulated and the sub-band signals are synthesized to reconstruct the audio signal back. The DPLSTM model for dereverberation of envelope and carrier components also allows the joint learning of the network weights for the down stream ASR task. In the ASR tasks on the REVERB challenge dataset as well as on the VOiCES dataset, we illustrate that the joint learning of speech dereverberation network and the E2E ASR model yields significant performance improvements over the baseline ASR system trained on log-mel spectrogram as well as other benchmarks for dereverberation (average relative improvements of 10-24% over the baseline system). The speech quality improvements, evaluated using subjective listening tests, further highlight the improved quality of the reconstructed audio.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-field automatic speech recognition,” Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2020.
  2. T. Hain, L. Burget, J. Dines, P. N. Garner, F. Grézl, A. El Hannani, M. Huijbregts, M. Karafiat, M. Lincoln, and V. Wan, “Transcribing meetings with the AMIDA systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 486–498, 2012.
  3. V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and LSTMs,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 373–377, 2018.
  4. S. Ganapathy and V. Peddinti, “3-d CNN models for far-field multi-channel speech recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2018, pp. 5499–5503.
  5. A. Gusev, V. Volokhov, T. Andzhukaev, S. Novoselov, G. Lavrentyeva, M. Volkova, A. Gazizullina, A. Shulipa, A. Gorlanov, A. Avdeeva et al., “Deep speaker embeddings for far-field speaker recognition on short utterances,” arXiv preprint arXiv:2002.06033, 2020.
  6. Q. Jin, T. Schultz, and A. Waibel, “Far-field speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2023–2032, 2007.
  7. T. Yoshioka et al., “Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 114–126, 2012.
  8. S. Ganapathy, “Signal analysis using autoregressive models of amplitude modulation,” Ph.D. dissertation, Johns Hopkins University, 2012.
  9. T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE TASLP, vol. 18, no. 7, pp. 1717–1731, 2010.
  10. D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM transactions on Audio, Speech, and Language processing, vol. 25, no. 7, pp. 1492–1501, 2017.
  11. X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2011–2022, 2007.
  12. E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Transactions on audio, speech, and language processing, vol. 15, no. 5, pp. 1529–1539, 2007.
  13. R. Kumar, A. Sreeram, A. Purushothaman, and S. Ganapathy, “Unsupervised neural mask estimator for generalized eigen-value beamforming based ASR,” in IEEE ICASSP, 2020, pp. 7494–7498.
  14. M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7398–7402.
  15. V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and LSTMs,” IEEE Signal Processing Letters, vol. 25, issue 3, pp. 373-377, 2017.
  16. R. Talmon, I. Cohen, and S. Gannot, “Relative transfer function identification using convolutive transfer function approximation,” IEEE Transactions on audio, speech, and language processing, vol. 17, no. 4, pp. 546–555, 2009.
  17. Y. Avargel and I. Cohen, “System identification in the short-time fourier transform domain with crossband filtering,” IEEE transactions on Audio, Speech, and Language processing, vol. 15, no. 4, pp. 1305–1319, 2007.
  18. A. Purushothaman, A. Sreeram, R. Kumar, and S. Ganapathy, “Dereverberation of autoregressive envelopes for far-field speech recognition,” Computer Speech & Language, vol. 72, p. 101277, 2022.
  19. A. Purushothaman, A. Sreeram, and S. Ganapathy, “3-D acoustic modeling for far-field multi-channel speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6964–6968.
  20. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.
  21. K. Kinoshita et al., “The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in IEEE WASPAA, 2013, pp. 1–4.
  22. C. Richey, M. Barrios et al., “VOiCES obscured in complex environmental settings (voices) corpus,” arXiv preprint arXiv:1804.05053, 2018.
  23. M. Nandwana, J. Van Hout, M. McLaren, C. Richey, A. Lawson, and M. Barrios, “The voices from a distance challenge 2019 evaluation plan,” arXiv preprint arXiv:1902.10828, 2019.
  24. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014.
  25. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  26. Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Late reverberation suppression using recurrent neural networks with long short-term memory,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5434–5438.
  27. K. Han, Y. Wang, and D. Wang, “Learning spectral mapping for speech dereverberation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4628–4632.
  28. A. Pandey and D. Wang, “A new framework for CNN-based speech enhancement in the time domain,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019.
  29. M. Wöllmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, “Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6822–6826.
  30. Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  31. F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in International Conference on Latent Variable Analysis and Signal Separation.   Springer, 2015, pp. 91–99.
  32. A. L. Maas, T. M. O’Neil, A. Y. Hannun, and A. Y. Ng, “Recurrent neural network feature enhancement: The 2nd chime challenge,” in Proceedings The 2nd CHiME Workshop on Machine Listening in Multisource Environments held in conjunction with ICASSP, 2013, pp. 79–80.
  33. J. F. Santos and T. H. Falk, “Speech dereverberation with context-aware recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 7, pp. 1236–1246, 2018.
  34. L. Griffiths and C. Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Transactions on antennas and propagation, vol. 30, no. 1, pp. 27–34, 1982.
  35. S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Transactions on Signal Processing, vol. 49, no. 8, pp. 1614–1626, 2001.
  36. G. Li, S. Liang, S. Nie, W. Liu, and Z. Yang, “Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition,” Neural Networks, vol. 141, pp. 225–237, 2021.
  37. F.-J. Chang, M. Radfar, A. Mouchtaris, B. King, and S. Kunzmann, “End-to-end multi-channel transformer for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5884–5888.
  38. C. Kim, A. Garg, D. Gowda, S. Mun, and C. Han, “Streaming end-to-end speech recognition with jointly trained neural feature enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6773–6777.
  39. Z.-Q. Wang and D. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016.
  40. B. Wu, K. Li, F. Ge, Z. Huang, M. Yang, S. M. Siniscalchi, and C. Lee, “An end-to-end deep learning approach to simultaneous speech dereverberation and acoustic modeling for robust speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1289–1300, 2017.
  41. P. Vaidyanathan, “Quadrature mirror filter banks, m-band extensions and perfect-reconstruction techniques,” IEEE Assp Magazine, vol. 4, no. 3, pp. 4–20, 1987.
  42. P. Motlicek, S. Ganapathy, H. Hermansky, and H. Garudadri, “Scalable wide-band audio codec based on frequency domain linear prediction,” IDIAP, Tech. Rep., 2007.
  43. M. Athineos and D. P. Ellis, “Frequency-domain linear prediction for temporal features,” 2003.
  44. S. Ganapathy, P. Motlicek, and H. Hermansky, “Autoregressive models of amplitude modulations in audio compression,” IEEE transactions on audio, speech, and language processing, vol. 18, no. 6, pp. 1624–1631, 2009.
  45. S. Ganapathy, S. H. Mallidi, and H. Hermansky, “Robust feature extraction using modulation filtering of autoregressive models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 8, pp. 1285–1295, 2014.
  46. S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition of reverberant speech using frequency domain linear prediction,” IEEE Signal Processing Letters, vol. 15, pp. 681–684, 2008.
  47. S. Ganapathy, J. Pelecanos, and M. K. Omar, “Feature normalization for speaker verification in room reverberation,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2011, pp. 4836–4839.
  48. R. Kumar, A. Purushothaman, A. Sreeram, and S. Ganapathy, “End-to-end speech recognition with joint dereverberation of sub-band autoregressive envelopes,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6057–6061.
  49. K. Kinoshita, M. Delcroix, S. Gannot, E. A. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj et al., “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, p. 7, 2016.
  50. M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, “The multi-channel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2005, pp. 357–362.
  51. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” Proceedings of Interspeech, 2020.
  52. X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6633–6637.
  53. K. Tesch and T. Gerkmann, “Insights into deep non-linear filters for improved multi-channel speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 563–575, 2022.
  54. R. Zhou, W. Zhu, and X. Li, “Speech dereverberation with a reverberation time shortening target,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  55. S. Watanabe, T. Hori et al., “ESPNet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
  56. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” in NIPS-W, 2017.
  57. S. Karita, N. Chen et al., “A comparative study on transformer vs RNN in speech applications,” in IEEE ASRU, 2019, pp. 449–456.
  58. T. H. Falk, C. Zheng, and W.-Y. Chan, “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1766–1774, 2010.
  59. J. F. Santos, M. Senoussaoui, and T. H. Falk, “An improved non-intrusive intelligibility metric for noisy and reverberant speech,” in 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014, pp. 55–59.
  60. P. RECOMMENDATION, “Itu-tp. 808,” 2018.
  61. A. Subramanian, X. Wang, S. Watanabe, T. Taniguchi, D. Tran, and Y. Fujita, “An investigation of end-to-end multichannel speech recognition for reverberant and mismatch conditions,” arXiv preprint arXiv:1904.09049, 2019.
  62. J. Heymann, L. Drude, R. Haeb-Umbach, K. Kinoshita, and T. Nakatani, “Joint optimization of neural network-based wpe dereverberation and acoustic model for robust online ASR.”   IEEE ICASSP, 2019, pp. 6655–6659.
  63. Y. Fujita, A. Subramanian, M. Omachi, and S. Watanabe, “Attention-based asr with lightweight and dynamic convolutions,” in ICASSP.   IEEE, 2020, pp. 7034–7038.
  64. W. Zhang, A. Subramanian, X. Chang, S. Watanabe, and Y. Qian, “End-to-end far-field speech recognition with unified dereverberation and beamforming,” arXiv preprint arXiv:2005.10479, 2020.
  65. M. Mimura, S. Sakai, and T. Kawahara, “Speech dereverberation using long short-term memory,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Anurenjan Purushothaman (8 papers)
  2. Debottam Dutta (10 papers)
  3. Rohit Kumar (80 papers)
  4. Sriram Ganapathy (72 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.