Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor (2401.12473v1)

Published 23 Jan 2024 in eess.AS and cs.SD

Abstract: We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Y.-m. Qian, C. Weng, X.-k. Chang, S. Wang et al., “Past review, current progress, and challenges ahead on the cocktail party problem,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 40–63, 2018.
  2. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep Clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
  3. Y. Isik, J. Le Roux, Z. Chen, S. Watanabe et al., “Single-channel multi-speaker separation using deep clustering,” in Proc. Interspeech, 2016, pp. 545–549.
  4. Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proc. ICASSP, 2018, pp. 686–690.
  5. Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in Proc. ICASSP, 2017, pp. 246–250.
  6. Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation with deep attractor network,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 4, pp. 787–796, 2018.
  7. D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017, pp. 241–245.
  8. M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017.
  9. Y. Luo and N. Mesgarani, “TasNet: Time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP, 2018, pp. 697–700.
  10. ——, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
  11. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020, pp. 46–50.
  12. J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” in Proc. Interspeech, 2020, pp. 2642–2646.
  13. N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2840–2849, 2021.
  14. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi et al., “Attention is all you need in speech separation,” in Proc. ICASSP, 2021, pp. 21–25.
  15. J. Rixen and M. Renz, “QDPN - Quasi-dual-path network for single-channel speech separation,” in Proc. Interspeech, 2022, pp. 5353–5357.
  16. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee et al., “TF-GridNet: Making time-frequency domain models great again for monaural speaker separation,” in Proc. ICASSP, 2023, pp. 1–5.
  17. ——, “TF-GridNet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023.
  18. S. Zhao and B. Ma, “MossFormer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” in Proc. ICASSP, 2023, pp. 1–5.
  19. K. Kinoshita, L. Drude, M. Delcroix, and T. Nakatani, “Listening to each speaker one by one with recurrent selective hearing networks,” in Proc. ICASSP, 2018, pp. 5064–5068.
  20. J. Shi, J. Xu, G. Liu, and B. Xu, “Listen, think and listen again: Capturing top-down auditory attention for speaker-independent speech separation,” in Proc. IJCAI, 2018, pp. 4353–4360.
  21. N. G. Naoya Takahashi, Sudarsanam Parthasaarathy and Y. Mitsufuji, “Recursive speech separation for unknown number of speakers,” in Proc. Interspeech, 2019, pp. 1348–1352.
  22. J. Shi, X. Chang, P. Guo, S. Watanabe et al., “Sequence to multi-sequence learning via conditional chain mapping for mixture signals,” in Proc. NIPS, vol. 33, 2020, pp. 3735–3747.
  23. J. Shi, J. Xu, Y. Fujita, S. Watanabe et al., “Speaker-conditional chain model for speech separation and extraction,” in Proc. Interspeech, 2020, pp. 2707–2711.
  24. E. Nachmani, Y. Adi, and L. Wolf, “Voice separation with an unknown number of multiple speakers,” in Proc. ICML, 2020, pp. 7121–7132.
  25. J. Zhu, R. A. Yeh, and M. Hasegawa-Johnson, “Multi-decoder DPRNN: Source separation for variable number of speakers,” in Proc. ICASSP, 2021, pp. 3420–3424.
  26. S. E. Chazan, L. Wolf, E. Nachmani, and Y. Adi, “Single channel voice separation for unknown number of speakers under reverberant and noisy settings,” in Proc. ICASSP, 2021, pp. 3730–3734.
  27. S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue et al., “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” in Proc. Interspeech, 2020, pp. 269–273.
  28. S. R. Chetupalli and E. A. P. Habets, “Speech separation for an unknown number of speakers using transformers with encoder-decoder attractors,” in Proc. Interspeech, 2022, pp. 5393–5397.
  29. S. Maiti, Y. Ueda, S. Watanabe, C. Zhang et al., “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in Proc. SLT, 2023, pp. 480–487.
  30. S. R. Chetupalli and E. A. P. Habets, “Speaker counting and separation from single-channel noisy mixtures,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1681–1692, 2023.
  31. D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
  32. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  33. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  34. C. Raffel, N. Shazeer, A. Roberts, K. Lee et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, 2020.
  35. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.
  36. N. Carion, F. Massa, G. Synnaeve, N. Usunier et al., “End-to-end object detection with transformers,” in Proc. ECCV, 2020, pp. 213–229.
  37. E. Perez, F. Strub, H. de Vries, V. Dumoulin et al., “FiLM: Visual reasoning with a general conditioning layer,” in Proc. AAAI, 2018.
  38. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?” in Proc. ICASSP, 2019, pp. 626–630.
  39. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2018.
  40. E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” vol. 14, no. 4, pp. 1462–1469, 2006.
Citations (6)

Summary

We haven't generated a summary for this paper yet.