Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Statistical Beamformer Exploiting Non-stationarity and Sparsity with Spatially Constrained ICA for Robust Speech Recognition (2306.07562v3)

Published 13 Jun 2023 in eess.AS and cs.SD

Abstract: In this paper, we present a statistical beamforming algorithm as a pre-processing step for robust automatic speech recognition (ASR). By modeling the target speech as a non-stationary Laplacian distribution, a mask-based statistical beamforming algorithm is proposed to exploit both its output and masked input variance for robust estimation of the beamformer. In addition, we also present a method for steering vector estimation (SVE) based on a noise power ratio obtained from the target and noise outputs in independent component analysis (ICA). To update the beamformer in the same ICA framework, we derive ICA with distortionless and null constraints on target speech, which yields beamformed speech at the target output and noises at the other outputs, respectively. The demixing weights for the target output result in a statistical beamformer with the weighted spatial covariance matrix (wSCM) using a weighting function characterized by a source model. To enhance the SVE, the strict null constraints imposed by the Lagrange multiplier methods are relaxed by generalized penalties with weight parameters, while the strict distortionless constraints are maintained. Furthermore, we derive an online algorithm based on an optimization technique of recursive least squares (RLS) for practical applications. Experimental results on various environments using CHiME-4 and LibriCSS datasets demonstrate the effectiveness of the presented algorithm compared to conventional beamforming and blind source extraction (BSE) based on ICA on both batch and online processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. B. D. Van Veen and K. M. Buckley, “Beamforming: A versatile approach to spatial filtering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–24, 1988.
  2. T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind source separation exploiting higher-order frequency dependencies,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 70–79, 2007.
  3. N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).   IEEE, 2011, pp. 189–192.
  4. ——, “Auxiliary-function-based independent vector analysis with power of vector-norm type weighting functions,” in Proc. Asia Pacific Signal and Information Processing Association (APSIPA) Annual Summit and Conference.   IEEE, 2012, pp. 1–4.
  5. R. Scheibler and N. Ono, “Independent vector analysis with more microphones than sources,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).   IEEE, 2019, pp. 185–189.
  6. R. Ikeshita, T. Nakatani, and S. Araki, “Overdetermined independent vector analysis,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 591–595.
  7. K. Iwamoto, T. Ochiai, M. Delcroix, R. Ikeshita, H. Sato, S. Araki, and S. Katagiri, “How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR,” in Proc. Interspeech 2022, 2022, pp. 5418–5422.
  8. K. Kumatani, J. McDonough, and B. Raj, “Microphone array processing for distant speech recognition: From close-talking microphones to far-field sensors,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 127–140, 2012.
  9. T. N. Sainath, R. J. Weiss, K. W. Wilson, B. Li, A. Narayanan, E. Variani, M. Bacchiani, I. Shafran, A. Senior, K. Chin, A. Misra, and C. Kim, “Multichannel signal processing with deep neural networks for automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 965–979, 2017.
  10. W. Minhua, K. Kumatani, S. Sundaram, N. Ström, and B. Hoffmeister, “Frequency domain multi-channel acoustic modeling for distant speech recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6640–6644.
  11. A. S. Subramanian, X. Wang, M. K. Baskar, S. Watanabe, T. Taniguchi, D. Tran, and Y. Fujita, “Speech enhancement using end-to-end speech recognition objectives,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019, pp. 234–238.
  12. B. J. Cho and H.-M. Park, “Convolutional maximum-likelihood distortionless response beamforming with steering vector estimation for robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1352–1367, 2021.
  13. T. Menne, “The RWTH/UPB/FORTH system combination for the 4th CHiME challenge evaluation,” 2016.
  14. O. L. Frost, “An algorithm for linearly constrained adaptive array processing,” Proceedings of the IEEE, vol. 60, no. 8, pp. 926–935, 1972.
  15. X. Xiao, S. Zhao, D. L. Jones, E. S. Chng, and H. Li, “On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 3246–3250.
  16. J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 196–200.
  17. H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks.” in Interspeech, 2016, pp. 1981–1985.
  18. B. J. Cho, J.-M. Lee, and H.-M. Park, “A beamforming algorithm based on maximum likelihood of a complex Gaussian distribution with time-varying variances for robust speech recognition,” IEEE Signal Processing Letters, vol. 26, pp. 1398–1402, 2019.
  19. A. R. López, N. Ono, U. Remes, K. Palomäki, and M. Kurimo, “Designing multichannel source separation based on single-channel source separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 469–473.
  20. T. Nakatani, R. Ikeshita, K. Kinoshita, H. Sawada, and S. Araki, “Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6129–6133.
  21. T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 5210–5214.
  22. T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani, “Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 780–793, 2017.
  23. N. Ono and S. Miyabe, “Auxiliary-function-based independent component analysis for super-Gaussian sources,” in International Conference on Latent Variable Analysis and Signal Separation.   Springer, 2010, pp. 165–172.
  24. M. Kim and H.-M. Park, “Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition,” Signal Processing, vol. 117, pp. 126–137, 2015.
  25. R. Ikeshita, T. Nakatani, and S. Araki, “Block coordinate descent algorithms for auxiliary-function-based independent vector extraction,” IEEE Transactions on Signal Processing, vol. 69, pp. 3252–3267, 2021.
  26. Y. Mitsui, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi, and K. Kondo, “Vectorwise coordinate descent algorithm for spatially regularized independent low-rank matrix analysis,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 746–750.
  27. L. Li and K. Koishida, “Geometrically constrained independent vector analysis for directional speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 846–850.
  28. E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
  29. Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7284–7288.
  30. K. Matsuoka, “Minimal distortion principle for blind source separation,” in Proceedings of the 41st SICE Annual Conference. SICE 2002., vol. 4.   IEEE, 2002, pp. 2138–2143.
  31. J.-F. Cardoso, “Multidimensional independent component analysis,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4.   IEEE, 1998, pp. 1941–1944.
  32. T. Taniguchi, N. Ono, A. Kawamura, and S. Sagayama, “An auxiliary-function approach to online independent vector analysis for real-time blind source separation,” in Proc. Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).   IEEE, 2014, pp. 107–111.
  33. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding.   IEEE Signal Processing Society, 2011.
  34. J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 504–511.
  35. S. Chen, Y. Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, and M. Zhou, “Continuous speech separation with conformer,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5749–5753.
  36. T. Nakashima and N. Ono, “Inverse-free online independent vector analysis with flexible iterative source steering,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 749–753.
  37. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
Citations (2)

Summary

We haven't generated a summary for this paper yet.