Subspace Hybrid MVDR Beamforming for Augmented Hearing (2311.18689v1)
Abstract: Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforward. The design of robust, high-performance, adaptive beamformers for such scenarios is an on-going challenge. This is due to the violation of the typically required assumptions on the noise field caused by, for example, rapid variations resulting from complex acoustic environments, and/or rotations of the listener's head. This work proposes a multi-channel speech enhancement algorithm which utilises the adaptability of signal-dependent beamformers while still benefiting from the computational efficiency and robust performance of signal-independent super-directive beamformers. The algorithm has two stages. (i) The first stage is a hybrid beamformer based on a dictionary of weights corresponding to a set of noise field models. (ii) The second stage is a wide-band subspace post-filter to remove any artifacts resulting from (i). The algorithm is evaluated using both real-world recordings and simulations of a cocktail-party scenario. Noise suppression, intelligibility and speech quality results show a significant performance improvement by the proposed algorithm compared to the baseline super-directive beamformer. A data-driven implementation of the noise field dictionary is shown to provide more noise suppression, and similar speech intelligibility and quality, compared to a parametric dictionary.
- E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975–979, Sep. 1953.
- A. W. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,” vol. 86, pp. 117–128, 2000.
- J. Donley, V. Tourbabin, J.-S. Lee, M. Broyles, H. Jiang, J. Shen, M. Pantic, V. K. Ithapu, and R. Mehra, “EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments,” Oct. 2021.
- “SPEAR challenge website,” https://imperialcollegelondon.github.io/spear-challenge.
- S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acoustic beamforming for hearing aid applications,” in Handbook on Array Processing and Sensor Networks, S. Haykin and K. J. R. Liu, Eds. John Wiley & Sons, Inc., 2010, pp. 269–302.
- H. W. Löllmann, A. H. Moore, P. A. Naylor, B. Rafaely, R. Horaud, A. Mazel, and W. Kellermann, “Microphone array signal processing for robot audition,” in Proc. Joint Workshop on Hands-Free Speech Communication and Microphone Arrays (HSCMA), San Francisco, CA, USA, Mar. 2017, pp. 51–55.
- T. J. Klasen, T. Van den Bogaert, M. Moonen, and J. Wouters, “Binaural noise reduction algorithms for hearing aids that preserve interaural time delay cues,” IEEE Trans. Signal Process., vol. 55, no. 4, pp. 1579–1585, Apr. 2007.
- R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister, M. L. Seltzer, H. Zen, and M. Souden, “Speech processing for digital home assistants: Combining signal processing with deep-learning techniques,” IEEE Signal Processing Magazine, vol. 36, no. 6, pp. 111–124, Nov 2019.
- Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Interspeech 2020, Oct. 2020, pp. 2472–2476.
- Y. Luo and N. Mesgarani, “TasNet: Time-domain audio separation network for real-time, single-channel speech separation,” Apr. 2018.
- S. Hafezi, A. H. Moore, P. Guiraud, P. A. Naylor, J. Donley, V. Tourbabin, and T. Lunner, “Subspace hybrid beamforming for head-worn microphone arrays,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), 2023, pp. 1–5.
- S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 4, pp. 692–730, Apr. 2017.
- J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” Proc. IEEE, vol. 57, no. 8, pp. 1408–1418, Aug. 1969.
- ——, “Optimum waveform estimation,” in Optimum Array Processing. New York, USA: John Wiley & Sons, Inc., Apr. 2002, vol. IV, Optimum Array Processing, ch. 6, pp. 428–709.
- H. Cox, “Resolving power and sensitivity to mismatch of optimum array processors,” J. Acoust. Soc. Am., vol. 54, no. 3, pp. 771–785, Sep. 1973.
- L. Ehrenberg, S. Gannot, A. Leshem, and E. Zehavi, “Sensitivity analysis of MVDR and MPDR beamformers,” in IEEE Conv. Electrical and Electronics Engineers, Eilat, Israel, Nov. 2010, pp. 416–420.
- E. E. Jan and J. Flanagan, “Sound capture from spatial volumes: Matched-filter processing of microphone arrays having randomly-distributed sensors,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), Atlanta, Georgia, USA, May 1996, pp. 917–920.
- J. Bitzer and K. U. Simmer, “Superdirective microphone arrays,” in Microphone Arrays: Signal Processing Techniques and Applications, ser. Digital Signal Processing, M. Brandstein and D. Ward, Eds. Berlin, Heidelberg: Springer, 2001, pp. 19–38.
- A. H. Moore, S. Hafezi, R. R. Vos, P. A. Naylor, and M. Brookes, “A Compact Noise Covariance Matrix Model for MVDR Beamforming,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2049–2061, 2022.
- A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “TPARN: Triple-path attentive recurrent network for time-domain multichannel speech enhancement,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), 2022, pp. 6497–6501.
- J. Casebeer, J. Donley, D. Wong, B. Xu, and A. Kumar, “Nice-beam: Neural integrated covariance estimators for time-varying beamformers,” 2021.
- J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2016, pp. 196–200.
- Z.-Q. Wang and D. Wang, “All-Neural Multi-Channel Speech Enhancement,” in Proc. Interspeech 2018, Sep. 2018, p. 3238.
- D. Wang and C. Bao, “Multi-channel Speech Enhancement Based on the MVDR Beamformer and Postfilter,” in 2020 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Aug. 2020, pp. 1–5.
- Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL-MVDR: All Deep Learning MVDR Beamformer for Target Speech Separation,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 6089–6093.
- Y. Liu, A. Ganguly, K. Kamath, and T. Kristjansson, “Neural network based time-frequency masking and steering vector estimation for two-channel mvdr beamforming,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Apr 2018.
- H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks.” in Proc. Interspeech, 2016, pp. 1981–1985.
- X. Ren, X. Zhang, L. Chen, X. Zheng, C. Zhang, L. Guo, and B. Yu, “A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement,” in Interspeech 2021, Aug. 2021, pp. 1832–1836.
- Y. Chen, Y. Hsu, and M. R. Bai, “Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection,” Jun. 2022.
- Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S.-C. Liu, “FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp. 260–267.
- H. Cox, R. M. Zeskind, and M. M. Owen, “Robust adaptive beamforming,” IEEE Trans. Signal Process., vol. 35, no. 10, pp. 1365–1376, Oct. 1987.
- J. Li, P. Stoica, and Z. Wang, “On robust capon beamforming and diagonal loading,” IEEE Trans. Signal Process., vol. 51, no. 7, pp. 1702–1715, Jul. 2003.
- J. R. Driscoll and D. M. Healy, “Computing Fourier transforms and convolutions on the 2-sphere,” Advances in Applied Mathematics, vol. 15, no. 2, pp. 202–250, Jun. 1994.
- mh acoustics, “EM32 Eigenmike microphone array release notes (v17.0),” M. H. Acoust., NJ USA, Hardware, Oct. 2013. [Online]. Available: http://www.mhacoustics.com/sites/default/files/ReleaseNotes.pdf
- G. Grimm, J. Luberadzka, and V. Hohmann, “A toolbox for rendering virtual acoustic environments in the context of audiology,” Acta Acustica united with Acustica, vol. 105, no. 3, pp. 566–578, 2019.
- C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 7, pp. 2125–2136, Sep. 2011.
- ITU-T, “Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” Int. Telecommun. Union (ITU-T), Recommendation P.862, Nov. 2003.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), vol. 2, Salt Lake City, UT, USA, May 2001, pp. 749–752.
- J. Tribolet, P. Noll, B. McDermott, and R. Crochiere, “A study of complexity and quality of speech waveform coders,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), vol. 3, Tulsa, OK, USA, Apr. 1978, pp. 586–590.
- Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 1, pp. 229–238, Jan. 2008.
- E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 4, pp. 1462–1469, Jul. 2006.
- D. M. Brookes, “VOICEBOX: A speech processing toolbox for MATLAB,” 1997. [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
- V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Subjective and objective quality assessment of audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2046–2057, 2011.
- E. Vincent, “PEASS: Perceptual evaluation methods for audio source separation toolbox for MATLAB,” 2017.
- “SS-Hybrid audio demo [online],” https://imperialcollegelondon.github.io/sap-SSHybrid-journal/.
- Sina Hafezi (2 papers)
- Alastair H. Moore (5 papers)
- Pierre H. Guiraud (1 paper)
- Patrick A. Naylor (27 papers)
- Jacob Donley (19 papers)
- Vladimir Tourbabin (15 papers)
- Thomas Lunner (2 papers)