Binaural multichannel blind speaker separation with a causal low-latency and low-complexity approach (2312.05173v1)
Abstract: In this paper, we introduce a causal low-latency low-complexity approach for binaural multichannel blind speaker separation in noisy reverberant conditions. The model, referred to as Group Communication Binaural Filter and Sum Network (GCBFSnet) predicts complex filters for filter-and-sum beamforming in the time-frequency domain. We apply Group Communication (GC), i.e., latent model variables are split into groups and processed with a shared sequence model with the aim of reducing the complexity of a simple model only containing one convolutional and one recurrent module. With GC we are able to reduce the size of the model by up to 83 % and the complexity up to 73 % compared to the model without GC, while mostly retaining performance. Even for the smallest model configuration, GCBFSnet matches the performance of a low-complexity TasNet baseline in most metrics despite the larger size and higher number of required operations of the baseline.
- J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016.
- M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
- Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation,” in Proc. ICASSP, 2023.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in Proc. ICASSP, 2020.
- E. Tzinis, Z. Wang, X. Jiang, and P. Smaragdis, “Compute and memory efficient universal sound source separation,” Journal of Signal Processing Systems, vol. 94, no. 2, pp. 245–259, 2022.
- X. Ren, X. Zhang, L. Chen, X. Zheng, C. Zhang, L. Guo, and B. Yu, “A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement,” in Proc. Interspeech, 2021.
- Z.-Q. Wang, G. Wichern, S. Watanabe, and J. Le Roux, “Stft-domain neural speech enhancement with very low algorithmic latency,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 397–410, 2023.
- X. Li and R. Horaud, “Multichannel Speech Enhancement Based On Time-Frequency Masking Using Subband Long Short-Term Memory,” in Proc. WASPAA, 2019.
- Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S.-C. Liu, “FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing,” in Proc. ASRU, 2019.
- C. Han, Y. Luo, and N. Mesgarani, “Real-Time Binaural Speech Separation with Preserved Spatial Cues,” in Proc. ICASSP, 2020.
- M. Tammen and S. Doclo, “Deep Multi-Frame MVDR Filtering for Binaural Noise Reduction,” in Proc. IWAENC, 2022.
- Y. Luo, C. Han, and N. Mesgarani, “Ultra-Lightweight Speech Separation Via Group Communication,” in Proc. ICASSP, 2021.
- ——, “Group Communication With Context Codec for Lightweight Source Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1752–1761, 2021.
- L. Bramsløw, G. Naithani, A. Hafez, T. Barker, N. H. Pontoppidan, and T. Virtanen, “Improving competing voices segregation for hearing impaired listeners using a low-latency deep neural network algorithm,” The Journal of the Acoustical Society of America, vol. 144, no. 1, pp. 172–185, 2018.
- N. L. Westhausen and B. T. Meyer, “Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids,” in Proc. WASPAA, 2023.
- J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An Open-Source Dataset for Generalizable Speech Separation,” 2020.
- H. Schröter, A. Maier, A. Escalante-B, and T. Rosenkranz, “Deepfilternet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio,” in Proc. IWAENC, 2022.
- Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation,” in Proc. ICASSP, 2020.
- K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey, R. A. Saurous, J. Skoglund, and R. F. Lyon, “Exploring tradeoffs in models for low-latency speech enhancement,” in Proc. IWAENC, 2018.
- J.-M. Valin, U. Isik, N. Phansalkar, R. Giri, K. Helwani, and A. Krishnaswamy, “A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech,” in Proc. Interspeech, 2020.
- A. Aroudi and S. Braun, “DBnet: Doa-Driven Beamforming Network for end-to-end Reverberant Sound Source Separation,” in Proc. ICASSP, 2021.
- S. Braun and I. Tashev, “A consolidated view of loss functions for supervised deep learning-based speech enhancement,” in Proc. TSP, 2021.
- G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. Le Roux, “Wham!: Extending speech separation to noisy environments,” in Proc. Interspeech, Sep. 2019.
- T. Wendt, S. van de Par, and S. d. Ewert, “A computationally-efficient and perceptually-plausible algorithm for binaural room impulse response simulation,” Journal of the Audio Engineering Society, vol. 62, no. 11, pp. 748–766, november 2014.
- J. Thiemann and S. van de Par, “A multiple model high-resolution head-related impulse response database for aided and unaided ears,” EURASIP Journal on Advances in Signal Processing, vol. 2019, no. 1, p. 9, Feb 2019.
- S. Graetzer, J. Barker, T. J. Cox, M. Akeroyd, J. F. Culling, G. Naylor, E. Porter, and R. V. Muñoz, “Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing,” in Proc. Interspeech, 2021.
- P. Seetharaman, G. Wichern, B. Pardo, and J. L. Roux, “Autoclip: Adaptive Gradient Clipping for Source Separation Networks,” in Proc. MLSP, 2020.
- J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in Proc. ICASSP, 2019, pp. 626–630.
- “ITU-T P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.” 2001.
- J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Perception Index (HASPI) Version 2,” Speech Communication, vol. 131, pp. 35–46, 2021.
- B. A. Edmonds and J. F. Culling, “The spatial unmasking of speech: Evidence for better-ear listening,” The Journal of the Acoustical Society of America, vol. 120, no. 3, pp. 1539–1545, 2006.
- A. H. Andersen, J. M. de Haan, Z.-H. Tan, and J. Jensen, “Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions,” Speech Communication, vol. 102, pp. 1–13, 2018.
- J. Karrenbauer, S. Klein, S. Schönewald, L. Gerlach, M. Blawat, J. Benndorf, and H. Blume, “SmartHeaP - A High-level Programmable, Low Power, and Mixed-Signal Hearing Aid SoC in 22nm FD-SOI,” in Proc. ESSCIRC, 2022.
- Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “Fneural speech enhancement with very low algorithmic latency and complexity via integrated full- and sub-band modeling,” in Proc. ICASSP, 2023.
- H. Taherian, K. Tan, and D. Wang, “Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2791–2800, 2022.
- C. Han and N. Mesgarani, “Online Binaural Speech Separation Of Moving Speakers With A Wavesplit Network,” in Proc. ICASSP, 2023.
- A. Favre-Félix, C. Graversen, R. K. Hietkamp, T. Dau, and T. Lunner, “Improving speech intelligibility by hearing aid eye-gaze steering: Conditions with head fixated in a multitalker environment,” Trends in Hearing, vol. 22, 2018.
- J. Kidd, Gerald, S. Favrot, J. G. Desloge, T. M. Streeter, and C. R. Mason, “Design and preliminary testing of a visually guided hearing aid,” The Journal of the Acoustical Society of America, vol. 133, no. 3, pp. EL202–EL207, 02 2013.
- C. Han, J. O’Sullivan, Y. Luo, J. Herrero, A. D. Mehta, and N. Mesgarani, “Speaker-independent auditory attention decoding without access to clean speech sources,” Science Advances, vol. 5, no. 5, p. eaav6134, 2019.
- A. Aroudi and S. Doclo, “Cognitive-Driven Binaural Beamforming Using EEG-Based Auditory Attention Decoding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 862–875, 2020.
- I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Yang, A. Mandell, Y. Gan, M. Mattina, and P. N. Whatmough, “TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids,” in Proc. Interspeech, 2020.
- M. Stamenovic, N. Westhausen, L.-C. Yang, C. Jensen, and A. Pawlicki, “Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,” in Proc. NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2021.
- Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Late reverberation suppression using recurrent neural networks with long short-term memory,” in Proc. ICASSP, 2018.
- N. L. Westhausen, R. Huber, H. Baumgartner, R. Sinha, J. Rennies, and B. T. Meyer, “Reduction of subjective listening effort for tv broadcast signals with recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3541–3550, 2021.