EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings (2312.06065v1)
Abstract: In recent years, there have been studies to further improve the end-to-end neural speaker diarization (EEND) systems. This letter proposes the EEND-DEMUX model, a novel framework utilizing demultiplexed speaker embeddings. In this work, we focus on disentangling speaker-relevant information in the latent space and then transform each separated latent variable into its corresponding speech activity. EEND-DEMUX can directly obtain separated speaker embeddings through the demultiplexing operation in the inference phase without an external speaker diarization system, an embedding extractor, or a heuristic decoding technique. Furthermore, we employ a multi-head cross-attention mechanism to capture the correlation between mixture and separated speaker embeddings effectively. We formulate three loss functions based on matching, orthogonality, and sparsity constraints to learn robust demultiplexed speaker embeddings. The experimental results on the LibriMix dataset show consistently improved performance in both a fixed and flexible number of speakers scenarios.
- T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,” Comput. Speech Lang., vol. 72, no. 6, pp. 101317–101350, Mar. 2022.
- S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka, and N. Ryant, “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in CHiME-6 Workshop, 2020.
- S. Cornell, M. Wiesner, S. Watanabe, D. Raj, X. Chang, P. Garcia, Y. Masuyama, Z.-Q. Wang, S. Squartini, and S. Khudanpur, “The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenarios,” arXiv preprint arXiv:2306.13734, 2023.
- J. Jung, H. S. Heo, B. J. Lee, J. Huh, A. Brown, Y. Kwon, S. Watanabe, and J. S. Chung, “In search of strong embedding extractors for speaker diarisation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023.
- Y. Kwon, H. S. Heo, J. Jung, Y. J. Kim, B. J. Lee, and J. S. Chung, “Multi-scale speaker embedding-based graph attention networks for speaker diarisation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023.
- H. S. Heo, Y. Kwon, B. J. Lee, Y. J. Kim, and J. Jung, “High-resolution embedding extractor for speaker diarisation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023.
- D Raj, L Lu, Z Chen, Y Gaur, and J Li, “Continuous streaming multi-talker ASR with dual-path transducers,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 7317–7321.
- Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation free objectives,” in Proc. INTERSPEECH, 2019, pp. 4300–4304.
- Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” in Proc. IEEE Autom. Speech Reconit. Understanding Workshop, 2019, pp. 296-–303.
- D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 241–245.
- S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” in Proc. INTERSPEECH, 2020, pp. 269–-273.
- S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and P. Garcia, “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1493-1507, 2022.
- Y. Yu, D. Park, and H. K. Kim, “Auxiliary loss of transformer with residual connection for end-to-end speaker diarization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 8377–-8381.
- Y.-R. Jeoung, J.-Y. Yang, J.-H. Choi, and J.-H. Chang, “Improving transformer-based end-to-end speaker diarization by assigning auxiliary losses to attention heads,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- Y.-R. Jeoung, J.-H. Choi, J.-S. Seong, J-H. Kyung, and J.-H. Chang, “Self-distillation into self-attention heads for improving transformer-based end-to-end neural speaker diarization,” in Proc. INTERSPEECH, 2023, pp. 3197–3201.
- M. Rybicka, J. Villalba, N. Dehak, and K. Kowalczyk, “End-to-end neural speaker diarization with an iterative refinement of non-autoregressive attention-based attractors,” in Proc. INTERSPEECH, 2023, pp. 3197–3201.
- Y. Fujita, T. Komatsu, R. Scheibler, Y. Kida, and T. Ogawa, “Neural diarization with non-autoregressive intermediate attractors,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- S. Broughton and L. Samarakoon, “Improving end-to-end neural diarization using conversational summary representations,” in Proc. INTERSPEECH, 2023, pp. 3157–3161.
- I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” Proc. INTERSPEECH, 2020, pp. 274–278.
- I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “The STC system for the CHiME-6 challenge,” in CHiME 2020 Workshop on Speech Process. in Everyday Environ., 2020.
- D. Wang, X. Xiao, N. Kanda, T. Yoshioka, and J. Wu, “Target speaker voice activity detection with transformers and its integration with end-to-end neural diarization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- C. Boeddeker, A. S. Subramanian, G. Wichern, R. Haeb-Umbach, and J. L. Roux, “TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings,” arXiv preprint arXiv:2303.03849, 2023.
- Z. Chen, B. Han, S. Wang, and Y. Qian, “Attention-based encoder-decoder network for end-to-end neural speaker diarization with target speaker attractor,” in Proc. INTERSPEECH, 2023, pp. 3552–3556.
- Z. Chen, B. Han, S. Wang, and Y. Qian, “Attention-based encoder-decoder end-to-end neural diarization with embedding enhancer,” arXiv preprint arXiv:2309.06672, 2023.
- C. Wang, J. Li, X. Fang, J. Kang, and Y. Li, “End-to-end neural speaker diarization with absolute speaker loss,” in Proc. INTERSPEECH, 2023, pp. 3577–3581.
- S. Maiti, Y. Ueda, S. Watanabe, C. Zhang, M. Yu, S.-X. Zhang, and Y. Xu, “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2023, pp. 480–487.
- J. Ao, M. S. Yıldırım, M. Ge, S. Wang, R. Tao, Y. Qian, L. Deng, L. Xiao, H. Li, “USED: Universal speaker extraction and diarization,” arXiv preprint arXiv:2309.10674, 2023.
- K. Ranasinghe, M. Naseer, M. Hayat, S. Khan, and F. S. Khan, “Orthogonal projection loss,” in Proc. IEEE/CVF Int. Conf. Comput. Vision, 2021, pp. 12333–12343.
- M. Faruqui, Y. Tsvetkov, D. Yogatama, C. Dyer, and N. A. Smith, “Sparse overcomplete word vector representations,” in Proc. Assoc. Comput. Linguistics, 2015, pp. 1491-–1500.
- A. Bhowmik, N. Liu, E. Zhong, B. Bhaskar, and S. Rajan, “Geometry aware mappings for high dimensional sparse factors,” in Proc. Artif. Intell. Statist., 2016, pp. 455-–463.
- T. Medini, B. Chen, and A. Shrivastava, “Solar: Sparse orthogonal learned and random embeddings,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–16.
- J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An opensource dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2015, pp. 5206–5210.
- G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” in Proc. INTERSPEECH, 2019, pp. 1368–1372.
- S. H. Mun, J.-w. Jung, M. H. Han, and N. S. Kim “Frequency and multi-scale selective kernel attention for speaker verification,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2023, pp. 548–554.
- A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. INTERSPEECH, 2017, pp. 2616–2620.
- J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. INTERSPEECH, 2018, pp. 1086–1090.