Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor (2401.12473v1)
Abstract: We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.
- Y.-m. Qian, C. Weng, X.-k. Chang, S. Wang et al., “Past review, current progress, and challenges ahead on the cocktail party problem,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 40–63, 2018.
- J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep Clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
- Y. Isik, J. Le Roux, Z. Chen, S. Watanabe et al., “Single-channel multi-speaker separation using deep clustering,” in Proc. Interspeech, 2016, pp. 545–549.
- Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proc. ICASSP, 2018, pp. 686–690.
- Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in Proc. ICASSP, 2017, pp. 246–250.
- Y. Luo, Z. Chen, and N. Mesgarani, “Speaker-independent speech separation with deep attractor network,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 4, pp. 787–796, 2018.
- D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017, pp. 241–245.
- M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017.
- Y. Luo and N. Mesgarani, “TasNet: Time-domain audio separation network for real-time, single-channel speech separation,” in Proc. ICASSP, 2018, pp. 697–700.
- ——, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020, pp. 46–50.
- J. Chen, Q. Mao, and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” in Proc. Interspeech, 2020, pp. 2642–2646.
- N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2840–2849, 2021.
- C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi et al., “Attention is all you need in speech separation,” in Proc. ICASSP, 2021, pp. 21–25.
- J. Rixen and M. Renz, “QDPN - Quasi-dual-path network for single-channel speech separation,” in Proc. Interspeech, 2022, pp. 5353–5357.
- Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee et al., “TF-GridNet: Making time-frequency domain models great again for monaural speaker separation,” in Proc. ICASSP, 2023, pp. 1–5.
- ——, “TF-GridNet: Integrating full- and sub-band modeling for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023.
- S. Zhao and B. Ma, “MossFormer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” in Proc. ICASSP, 2023, pp. 1–5.
- K. Kinoshita, L. Drude, M. Delcroix, and T. Nakatani, “Listening to each speaker one by one with recurrent selective hearing networks,” in Proc. ICASSP, 2018, pp. 5064–5068.
- J. Shi, J. Xu, G. Liu, and B. Xu, “Listen, think and listen again: Capturing top-down auditory attention for speaker-independent speech separation,” in Proc. IJCAI, 2018, pp. 4353–4360.
- N. G. Naoya Takahashi, Sudarsanam Parthasaarathy and Y. Mitsufuji, “Recursive speech separation for unknown number of speakers,” in Proc. Interspeech, 2019, pp. 1348–1352.
- J. Shi, X. Chang, P. Guo, S. Watanabe et al., “Sequence to multi-sequence learning via conditional chain mapping for mixture signals,” in Proc. NIPS, vol. 33, 2020, pp. 3735–3747.
- J. Shi, J. Xu, Y. Fujita, S. Watanabe et al., “Speaker-conditional chain model for speech separation and extraction,” in Proc. Interspeech, 2020, pp. 2707–2711.
- E. Nachmani, Y. Adi, and L. Wolf, “Voice separation with an unknown number of multiple speakers,” in Proc. ICML, 2020, pp. 7121–7132.
- J. Zhu, R. A. Yeh, and M. Hasegawa-Johnson, “Multi-decoder DPRNN: Source separation for variable number of speakers,” in Proc. ICASSP, 2021, pp. 3420–3424.
- S. E. Chazan, L. Wolf, E. Nachmani, and Y. Adi, “Single channel voice separation for unknown number of speakers under reverberant and noisy settings,” in Proc. ICASSP, 2021, pp. 3730–3734.
- S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue et al., “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” in Proc. Interspeech, 2020, pp. 269–273.
- S. R. Chetupalli and E. A. P. Habets, “Speech separation for an unknown number of speakers using transformers with encoder-decoder attractors,” in Proc. Interspeech, 2022, pp. 5393–5397.
- S. Maiti, Y. Ueda, S. Watanabe, C. Zhang et al., “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in Proc. SLT, 2023, pp. 480–487.
- S. R. Chetupalli and E. A. P. Habets, “Speaker counting and separation from single-channel noisy mixtures,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1681–1692, 2023.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
- J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, 2020.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, 2015.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier et al., “End-to-end object detection with transformers,” in Proc. ECCV, 2020, pp. 213–229.
- E. Perez, F. Strub, H. de Vries, V. Dumoulin et al., “FiLM: Visual reasoning with a general conditioning layer,” in Proc. AAAI, 2018.
- J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?” in Proc. ICASSP, 2019, pp. 626–630.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2018.
- E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” vol. 14, no. 4, pp. 1462–1469, 2006.