Sparsity-Driven EEG Channel Selection for Brain-Assisted Speech Enhancement (2311.13436v3)
Abstract: Speech enhancement is widely used as a front-end to improve the speech quality in many audio systems, while it is hard to extract the target speech in multi-talker conditions without prior information on the speaker identity. It was shown that the auditory attention on the target speaker can be decoded from the electroencephalogram (EEG) of the listener implicitly. In this work, we therefore propose a novel end-to-end brain-assisted speech enhancement network (BASEN), which incorporates the listeners' EEG signals and adopts a temporal convolutional network together with a convolutional multi-layer cross attention module to fuse EEG-audio features. Considering that an EEG cap with sparse channels exhibits multiple benefits and in practice many electrodes might contribute marginally, we further propose two channel selection methods, called residual Gumbel selection and convolutional regularization selection. They are dedicated to tackling training instability and duplicated channel selections, respectively. Experimental results on a public dataset show the superiority of the proposed BASEN over existing approaches. The proposed channel selection methods can significantly reduce the amount of informative EEG channels with a negligible impact on the performance.
- D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,” IEEE/ACM Trans. Audio Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, 2018.
- E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” J. Acous. Soc. Am., vol. 25, no. 5, pp. 975–979, 1953.
- W. E. M. J. Y, Zhao. D and E. W. Healy, “A deep learning based segregation algorithm to increase speech intelligibility for hearing-impaired listeners in reverberant-noisy conditions,” J. Acous. Soc. Am., vol. 144, no. 3, 2018.
- Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
- Y. L, Z. C, and T. Y, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE Int. Conf. on Acoustics, Speech, Signal Process. (ICASSP), 2020, pp. 46–50.
- Z.-Q. W, S. C, S. C, Y. L, B.-Y. K, and S. W, “TF-GridNet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023.
- C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1370–1384, 2020.
- Z. Pan, M. Ge, and H. Li, “USEV: Universal speaker extraction with visual cue,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 3032–3045, 2022.
- H. Xu, L. Wei, J. Zhang, and et al., “A multi-scale feature aggregation based lightweight network for audio-visual speech enhancement,” in IEEE Int. Conf. on Acoustics, Speech, Signal Process. (ICASSP), 2023, pp. 1–5.
- A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” arXiv preprint arXiv:1804.03619, 2018.
- J. Wu, Y. Xu, S.-X. Zhang, and et al., “Time domain audio visual speech separation,” in ASRU, 2019, pp. 667–673.
- Z. Pan, X. Qian, and H. Li, “Speaker extraction with co-speech gestures cue,” IEEE Signal Process. Letters, vol. 29, pp. 1467–1471, 2022.
- N. Mesgarani and E. F. Chang, “Selective cortical representation of attended speaker in multi-talker speech perception,” Nature, vol. 485, no. 7397, pp. 233–236, 2012.
- J. O’Sullivan, Z. Chen, J. Herrero, and et al., “Neural decoding of attentional selection in multi-speaker environments without access to clean sources,” J. Neural Engineering, vol. 14, no. 5, p. 056001, 2017.
- A. Aroudi and S. Doclo, “Cognitive-driven binaural beamforming using EEG-based auditory attention decoding,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 862–875, 2020.
- J. A. O’sullivan, A. J. Power, N. Mesgarani, and et al., “Attentional selection in a cocktail party environment can be decoded from single-trial EEG,” Cerebral cortex, vol. 25, no. 7, pp. 1697–1706, 2015.
- W. Biesmans, N. Das, T. Francart, and A. Bertrand, “Auditory-inspired speech envelope extraction methods for improved EEG-based auditory attention detection in a cocktail party scenario,” IEEE Trans. Neural Systems, Rehabilitation Engineer., vol. 25, no. 5, pp. 402–412, 2016.
- S. Geirnaert, S. Vandecappelle, E. Alickovic, and et al., “Electroencephalography-based auditory attention decoding: Toward neurosteered hearing devices,” IEEE Sig. Process. Mag., vol. 38, no. 4, pp. 89–102, 2021.
- S. Van Eyndhoven, T. Francart, and A. Bertrand, “EEG-informed attended speaker extraction from recorded speech mixtures with application in neuro-steered hearing prostheses,” IEEE Trans. Biomedical Engineering, vol. 64, no. 5, pp. 1045–1056, 2016.
- W. Pu, J. Xiao, T. Zhang, and Z.-Q. Luo, “A joint auditory attention decoding and adaptive binaural beamforming algorithm for hearing devices,” in IEEE Int. Conf. on Acoustics, Speech, Signal Process. (ICASSP), 2019, pp. 311–315.
- E. Ceolini, J. Hjortkjær, D. D. Wong, J. O’Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu, and N. Mesgarani, “Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception,” NeuroImage, vol. 223, p. 117282, 2020.
- M. Hosseini, L. Celotti, and É. Plourde, “Speaker-independent brain enhanced speech denoising,” in IEEE Int. Conf. on Acoustics, Speech, Signal Process. (ICASSP), 2021, pp. 1310–1314.
- M. Hosseini, L. Celotti, and E. Plourde, “End-to-End Brain-Driven Speech Enhancement in Multi-Talker Conditions,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1718–1733, 2022.
- E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in AAAI, vol. 32, no. 1, 2018.
- Z. Qiu, J. Gu, D. Yao, J. Li, and Y. Yan, “TF-NSSE: A time–frequency domain neuro-steered speaker extractor,” Applied Acoustics, vol. 211, p. 109519, 2023.
- Z. Pan, M. Borsdorf, S. Cai, T. Schultz, and H. Li, “Neuroheed: Neuro-steered speaker extraction using EEG signals,” arXiv preprint arXiv:2307.14303, 2023.
- T. Strypsteen and A. Bertrand, “End-to-end learnable EEG channel selection for deep neural networks with gumbel-softmax,” J. Neural Engineering, vol. 18, no. 4, p. 0460a9, 2021.
- H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Trans. Knowledge, Data Engineer., vol. 17, no. 4, pp. 491–502, 2005.
- T. Alotaiby, F. E. A. El-Samie, S. A. Alshebeili, and I. Ahmad, “A review of channel selection algorithms for EEG signal processing,” EURASIP J. Advances in Signal Process., vol. 2015, pp. 1–21, 2015.
- J. Zhang, G. Zhang, and L. Dai, “Frequency-invariant sensor selection for MVDR beamforming in wireless acoustic sensor networks,” IEEE Trans. Wireless Communications, vol. 21, no. 12, pp. 10 648–10 661, 2022.
- E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
- A. Abid, M. F. Balin, and J. Zou, “Concrete autoencoders for differentiable feature selection and reconstruction,” arXiv preprint arXiv:1901.09346, 2019.
- D. Singh, H. Climente-González, M. Petrovich, E. Kawakami, and M. Yamada, “Fsnet: Feature selection network on high-dimensional biological data,” in IJCNN, 2023, pp. 1–9.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- J. Zhang, Q. Xu, Q. Zhu, and Z. Ling, “BASEN: Time-domain brain-assisted speech enhancement network with convolutional cross attention in multi-talker conditions,” in ISCA Interspeech, 2023, pp. 3117–3121.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Process. Systems, vol. 30, 2017.
- L. Kaiser, A. N. Gomez, and F. Chollet, “Depthwise separable convolutions for neural machine translation,” arXiv preprint arXiv:1706.03059, 2017.
- F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR, 2017, pp. 1251–1258.
- C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” arXiv preprint arXiv:1611.00712, 2016.
- J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in IEEE Int. Conf. on Acoustics, Speech, Signal Process. (ICASSP), 2019, pp. 626–630.
- M. Kolbæk, Z.-H. Tan, S. H. Jensen, and J. Jensen, “On loss functions for supervised monaural time-domain speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 825–838, 2020.
- M. P. Broderick, A. J. Anderson, G. M. Di Liberto, and et al., “Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech,” Current Biology, vol. 28, no. 5, pp. 803–809, 2018.
- A. Delorme and S. Makeig, “EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis,” J. Neuroscience Methods, vol. 134, no. 1, pp. 9–21, 2004.
- K. Whittingstall and N. K. Logothetis, “Frequency-band coupling in surface EEG reflects spiking activity in monkey visual cortex,” Neuron, vol. 64, no. 2, pp. 281–289, 2009.
- M. Moinnereau, J. Rouat, K. Whittingstall, and E. Plourde, “A frequency-band coupling model of EEG signals can capture features from an input audio stimulus,” Hearing Research, vol. 393, p. 107994, 2020.
- A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in IEEE Int. Conf. on Acoustics, Speech, Signal Process. (ICASSP), vol. 2, 2001, pp. 749–752.
- C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in IEEE Int. Conf. on Acoustics, Speech, Signal Process. (ICASSP), 2010, pp. 4214–4217.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- B. C. Preisig, L. Riecke, and et al., “Selective modulation of interhemispheric connectivity by transcranial alternating current stimulation influences binaural integration,” Proceedings of the National Academy of Sciences, vol. 118, no. 7, p. e2015488118, 2021.