Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Modeling of Spatial Cues via Spherical Harmonics for Multi-Channel Speech Enhancement (2309.10393v1)

Published 19 Sep 2023 in cs.SD and eess.AS

Abstract: Multi-channel speech enhancement utilizes spatial information from multiple microphones to extract the target speech. However, most existing methods do not explicitly model spatial cues, instead relying on implicit learning from multi-channel spectra. To better leverage spatial information, we propose explicitly incorporating spatial modeling by applying spherical harmonic transforms (SHT) to the multi-channel input. In detail, a hierarchical framework is introduced whereby lower order harmonics capturing broader spatial patterns are estimated first, then combined with higher orders to recursively predict finer spatial details. Experiments on TIMIT demonstrate the proposed method can effectively recover target spatial patterns and achieve improved performance over baseline models, using fewer parameters and computations. Explicitly modeling spatial information hierarchically enables more effective multi-channel speech enhancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Acoustic array systems: theory, implementation, and application, John Wiley & Sons, 2013.
  2. “Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5751–5755.
  3. “Enhanced smart hearing aid using deep neural networks,” Alexandria Engineering Journal, vol. 58, no. 2, pp. 539–550, 2019.
  4. “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Transactions on signal processing, vol. 47, no. 10, pp. 2677–2684, 1999.
  5. “On optimal frequency-domain multichannel linear filtering for noise reduction,” IEEE Transactions on audio, speech, and language processing, vol. 18, no. 2, pp. 260–276, 2009.
  6. “Improved delay-and-sum beamforming algorithm for breast cancer detection,” International Journal of Antennas and Propagation, vol. 2008, 2008.
  7. “Superdirective microphone arrays,” in Microphone arrays, pp. 19–38. Springer, 2001.
  8. “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  9. “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings 12. Springer, 2015, pp. 91–99.
  10. “Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 260–267.
  11. “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6487–6491.
  12. “A causal u-net based neural beamforming network for real-time multi-channel speech enhancement.,” in Interspeech, 2021, pp. 1832–1836.
  13. “A study of learning based beamforming methods for speech recognition,” in CHiME 2016 workshop, 2016, pp. 26–31.
  14. “Blstm supported gev beamformer front-end for the 3rd chime challenge,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 444–451.
  15. “End-to-end multi-channel speech separation,” arXiv preprint arXiv:1905.06286, 2019.
  16. “Near-field acoustic source localization and beamforming in spherical harmonics domain,” IEEE Transactions on Signal Processing, vol. 64, no. 13, pp. 3351–3361, 2016.
  17. “A deep learning framework for robust doa estimation using spherical harmonic decomposition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1248–1259, 2020.
  18. “Doa estimation of indoor sound sources based on spherical harmonic domain beam-space music,” Symmetry, vol. 15, no. 1, pp. 187, 2023.
  19. “Binaural reproduction based on bilateral ambisonics and ear-aligned hrtfs,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 901–913, 2021.
  20. Or Nadiri and Boaz Rafaely, “Localization of multiple speakers under high reverberation using a spherical microphone array and the direct-path dominance test,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1494–1505, 2014.
  21. Boaz Rafaely, Fundamentals of spherical array processing, vol. 8, Springer, 2015.
  22. Jens Meyer, “Beamforming for a circular microphone array mounted on spherically shaped objects,” The Journal of the Acoustical Society of America, vol. 109, no. 1, pp. 185–193, 2001.
  23. “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
  24. “Speech database development at mit: Timit and beyond,” Speech communication, vol. 9, no. 4, pp. 351–356, 1990.
  25. “Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp. 247–251, 1993.
  26. “The third ‘chime’speech separation and recognition challenge: Dataset, task and baselines,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 504–511.
  27. “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 749–752.
  28. “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  29. “Dpcrn: Dual-path convolution recurrent network for single channel speech enhancement,” arXiv preprint arXiv:2107.05429, 2021.
  30. Ke Tan and DeLiang Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiahui Pan (7 papers)
  2. Shulin He (16 papers)
  3. Hui Zhang (405 papers)
  4. Xueliang Zhang (39 papers)