NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection (2312.07513v1)
Abstract: Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity. This activity is usually recorded using electroencephalography (EEG) devices. Though promising, current methods often have a high speaker confusion error, where the interfering speaker is extracted instead of the attended speaker, degrading the listening experience. In this work, we aim to reduce the speaker confusion error in the neuro-steered speaker extraction model through a jointly fine-tuned auxiliary auditory attention detection model. The latter reinforces the consistency between the extracted target speech signal and the EEG representation, and also improves the EEG representation. Experimental results show that the proposed network significantly outperforms the baseline in terms of speaker confusion and overall signal quality in two-talker scenarios.
- A. W. Bronkhorst, “The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,” Acta Acust. united Acust., vol. 86, no. 1, pp. 117–128, 2000.
- E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Am., vol. 25, no. 5, pp. 975–979, 1953.
- D. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectr., vol. 54, no. 3, pp. 32–37, 2017.
- “Predict-and-Update network: Audio-visual speech recognition inspired by human speech perception,” arXiv preprint arXiv:2209.01768, 2022.
- “Multi-target DoA estimation with an audio-visual fusion mechanism,” in Proc. ICASSP, 2021.
- “Multi-modal attention for speech emotion recognition,” in Proc. Interspeech, 2020.
- “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016.
- “Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020.
- “TF-GridNet: Making time-frequency domain models great again for monaural speaker separation,” in Proc. ICASSP, 2023.
- “Speaker extraction with co-speech gestures cue,” IEEE Signal Process. Lett., vol. 29, pp. 1467–1471, 2022.
- M. Geravanchizadeh and S. Zakeri, “Ear-EEG-based binaural speech enhancement (ee-BSE) using auditory attention detection and audiometric characteristics of hearing-impaired subjects,” J. Neural Eng., vol. 18, no. 4, pp. 0460d6, 2021.
- “Neural decoding of attentional selection in multi-speaker environments without access to clean sources,” J. Neural Eng., vol. 14, no. 5, pp. 056001, 2017.
- “Speaker-independent auditory attention decoding without access to clean speech sources,” Sci. Adv., vol. 5, no. 5, pp. eaav6134, 2019.
- “Attentional selection in a cocktail party environment can be decoded from single-trial EEG,” Cereb. Cortex, vol. 25, no. 7, pp. 1697–1706, 2015.
- “EEG-informed attended speaker extraction from recorded speech mixtures with application in neuro-steered hearing prostheses,” IEEE Trans. Biomed. Eng., vol. 64, no. 5, pp. 1045–1056, 2016.
- “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Proc. Interspeech, 2019.
- “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE J. Sel. Top. Signal Process., vol. 13, no. 4, pp. 800–814, 2019.
- “SpEx+: A complete time domain speaker extraction network,” in Proc. Interspeech, 2020.
- “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–11, 2018.
- “Time domain audio visual speech separation,” in Proc. ASRU, 2019.
- “Selective listening by synchronizing speech with lips,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1650–1664, 2022.
- “USEV: Universal speaker extraction with visual cue,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 3032–3045, 2022.
- “Beamformer-guided target speaker extraction,” in Proc. ICASSP, 2023.
- K. Tesch and T. Gerkmann, “Spatially selective deep non-linear filters for speaker extraction,” in Proc. ICASSP, 2023.
- “Heterogeneous target speech separation,” in Proc. Interspeech, Sept. 2022.
- D. L. Ringach, “Spontaneous and driven cortical activity: implications for computation,” Curr. Opin. Neurobiol., vol. 19, no. 4, pp. 439–444, 2009.
- K. D. Harris and A. Thiele, “Cortical state and attention,” Nat. Rev. Neurosci., vol. 12, no. 9, pp. 509–523, 2011.
- “EEG-Based Auditory Attention Detection and Its Possible Future Applications for Passive BCI,” Front. Comput. Sci., vol. 3, pp. 661178, 2021.
- “Auditory attention detection via cross-modal attention,” Front. Neurosci., vol. 15, pp. 652058, 2021.
- “Multi-head attention and GRU for improved match-mismatch classification of speech stimulus and EEG response,” in Proc. ICASSP, 2023.
- “Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception,” NeuroImage, vol. 223, pp. 117282, 2020.
- “End-to-end brain-driven speech enhancement in multi-talker conditions,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1718–1733, 2022.
- “BASEN: Time-domain brain-assisted speech enhancement network with convolutional cross attention in multi-talker conditions,” in Proc. Interspeech, 2023.
- “NeuroHeed: Neuro-steered speaker extraction using EEG signals,” arXiv preprint arXiv:2307.14303, 2023.
- “Auditory attention detection dataset KULeuven,” Zenodo, 2019.
- “Single-channel multi-speaker separation using deep clustering,” in Proc. Interspeech, 2016.
- “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017.
- “SDR–half-baked or well done?,” in Proc. ICASSP, 2019.