Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Binaural multichannel blind speaker separation with a causal low-latency and low-complexity approach (2312.05173v1)

Published 8 Dec 2023 in eess.AS and cs.SD

Abstract: In this paper, we introduce a causal low-latency low-complexity approach for binaural multichannel blind speaker separation in noisy reverberant conditions. The model, referred to as Group Communication Binaural Filter and Sum Network (GCBFSnet) predicts complex filters for filter-and-sum beamforming in the time-frequency domain. We apply Group Communication (GC), i.e., latent model variables are split into groups and processed with a shared sequence model with the aim of reducing the complexity of a simple model only containing one convolutional and one recurrent module. With GC we are able to reduce the size of the model by up to 83 % and the complexity up to 73 % compared to the model without GC, while mostly retaining performance. Even for the smallest model configuration, GCBFSnet matches the performance of a low-complexity TasNet baseline in most metrics despite the larger size and higher number of required operations of the baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016.
  2. M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  3. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation,” in Proc. ICASSP, 2023.
  4. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  5. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in Proc. ICASSP, 2020.
  6. E. Tzinis, Z. Wang, X. Jiang, and P. Smaragdis, “Compute and memory efficient universal sound source separation,” Journal of Signal Processing Systems, vol. 94, no. 2, pp. 245–259, 2022.
  7. X. Ren, X. Zhang, L. Chen, X. Zheng, C. Zhang, L. Guo, and B. Yu, “A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement,” in Proc. Interspeech, 2021.
  8. Z.-Q. Wang, G. Wichern, S. Watanabe, and J. Le Roux, “Stft-domain neural speech enhancement with very low algorithmic latency,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 397–410, 2023.
  9. X. Li and R. Horaud, “Multichannel Speech Enhancement Based On Time-Frequency Masking Using Subband Long Short-Term Memory,” in Proc. WASPAA, 2019.
  10. Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S.-C. Liu, “FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing,” in Proc. ASRU, 2019.
  11. C. Han, Y. Luo, and N. Mesgarani, “Real-Time Binaural Speech Separation with Preserved Spatial Cues,” in Proc. ICASSP, 2020.
  12. M. Tammen and S. Doclo, “Deep Multi-Frame MVDR Filtering for Binaural Noise Reduction,” in Proc. IWAENC, 2022.
  13. Y. Luo, C. Han, and N. Mesgarani, “Ultra-Lightweight Speech Separation Via Group Communication,” in Proc. ICASSP, 2021.
  14. ——, “Group Communication With Context Codec for Lightweight Source Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1752–1761, 2021.
  15. L. Bramsløw, G. Naithani, A. Hafez, T. Barker, N. H. Pontoppidan, and T. Virtanen, “Improving competing voices segregation for hearing impaired listeners using a low-latency deep neural network algorithm,” The Journal of the Acoustical Society of America, vol. 144, no. 1, pp. 172–185, 2018.
  16. N. L. Westhausen and B. T. Meyer, “Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids,” in Proc. WASPAA, 2023.
  17. J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An Open-Source Dataset for Generalizable Speech Separation,” 2020.
  18. H. Schröter, A. Maier, A. Escalante-B, and T. Rosenkranz, “Deepfilternet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio,” in Proc. IWAENC, 2022.
  19. Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation,” in Proc. ICASSP, 2020.
  20. K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey, R. A. Saurous, J. Skoglund, and R. F. Lyon, “Exploring tradeoffs in models for low-latency speech enhancement,” in Proc. IWAENC, 2018.
  21. J.-M. Valin, U. Isik, N. Phansalkar, R. Giri, K. Helwani, and A. Krishnaswamy, “A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech,” in Proc. Interspeech, 2020.
  22. A. Aroudi and S. Braun, “DBnet: Doa-Driven Beamforming Network for end-to-end Reverberant Sound Source Separation,” in Proc. ICASSP, 2021.
  23. S. Braun and I. Tashev, “A consolidated view of loss functions for supervised deep learning-based speech enhancement,” in Proc. TSP, 2021.
  24. G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. Le Roux, “Wham!: Extending speech separation to noisy environments,” in Proc. Interspeech, Sep. 2019.
  25. T. Wendt, S. van de Par, and S. d. Ewert, “A computationally-efficient and perceptually-plausible algorithm for binaural room impulse response simulation,” Journal of the Audio Engineering Society, vol. 62, no. 11, pp. 748–766, november 2014.
  26. J. Thiemann and S. van de Par, “A multiple model high-resolution head-related impulse response database for aided and unaided ears,” EURASIP Journal on Advances in Signal Processing, vol. 2019, no. 1, p. 9, Feb 2019.
  27. S. Graetzer, J. Barker, T. J. Cox, M. Akeroyd, J. F. Culling, G. Naylor, E. Porter, and R. V. Muñoz, “Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing,” in Proc. Interspeech, 2021.
  28. P. Seetharaman, G. Wichern, B. Pardo, and J. L. Roux, “Autoclip: Adaptive Gradient Clipping for Source Separation Networks,” in Proc. MLSP, 2020.
  29. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in Proc. ICASSP, 2019, pp. 626–630.
  30. “ITU-T P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.” 2001.
  31. J. M. Kates and K. H. Arehart, “The Hearing-Aid Speech Perception Index (HASPI) Version 2,” Speech Communication, vol. 131, pp. 35–46, 2021.
  32. B. A. Edmonds and J. F. Culling, “The spatial unmasking of speech: Evidence for better-ear listening,” The Journal of the Acoustical Society of America, vol. 120, no. 3, pp. 1539–1545, 2006.
  33. A. H. Andersen, J. M. de Haan, Z.-H. Tan, and J. Jensen, “Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions,” Speech Communication, vol. 102, pp. 1–13, 2018.
  34. J. Karrenbauer, S. Klein, S. Schönewald, L. Gerlach, M. Blawat, J. Benndorf, and H. Blume, “SmartHeaP - A High-level Programmable, Low Power, and Mixed-Signal Hearing Aid SoC in 22nm FD-SOI,” in Proc. ESSCIRC, 2022.
  35. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “Fneural speech enhancement with very low algorithmic latency and complexity via integrated full- and sub-band modeling,” in Proc. ICASSP, 2023.
  36. H. Taherian, K. Tan, and D. Wang, “Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2791–2800, 2022.
  37. C. Han and N. Mesgarani, “Online Binaural Speech Separation Of Moving Speakers With A Wavesplit Network,” in Proc. ICASSP, 2023.
  38. A. Favre-Félix, C. Graversen, R. K. Hietkamp, T. Dau, and T. Lunner, “Improving speech intelligibility by hearing aid eye-gaze steering: Conditions with head fixated in a multitalker environment,” Trends in Hearing, vol. 22, 2018.
  39. J. Kidd, Gerald, S. Favrot, J. G. Desloge, T. M. Streeter, and C. R. Mason, “Design and preliminary testing of a visually guided hearing aid,” The Journal of the Acoustical Society of America, vol. 133, no. 3, pp. EL202–EL207, 02 2013.
  40. C. Han, J. O’Sullivan, Y. Luo, J. Herrero, A. D. Mehta, and N. Mesgarani, “Speaker-independent auditory attention decoding without access to clean speech sources,” Science Advances, vol. 5, no. 5, p. eaav6134, 2019.
  41. A. Aroudi and S. Doclo, “Cognitive-Driven Binaural Beamforming Using EEG-Based Auditory Attention Decoding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 862–875, 2020.
  42. I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Yang, A. Mandell, Y. Gan, M. Mattina, and P. N. Whatmough, “TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids,” in Proc. Interspeech, 2020.
  43. M. Stamenovic, N. Westhausen, L.-C. Yang, C. Jensen, and A. Pawlicki, “Weight, block or unit? exploring sparsity tradeoffs for speech enhancement on tiny neural accelerators,” in Proc. NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2021.
  44. Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Late reverberation suppression using recurrent neural networks with long short-term memory,” in Proc. ICASSP, 2018.
  45. N. L. Westhausen, R. Huber, H. Baumgartner, R. Sinha, J. Rennies, and B. T. Meyer, “Reduction of subjective listening effort for tv broadcast signals with recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3541–3550, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.