DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors (2312.04324v3)
Abstract: Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.
- G. Sell et al., “Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.” in Interspeech, 2018, pp. 2808–2812.
- F. Landini et al., “BUT System for the Second DIHARD Speech Diarization Challenge,” in ICASSP. IEEE, 2020.
- T. J. Park et al., “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Processing Letters, vol. 27, pp. 381–385, 2019.
- ——, “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, p. 101317, 2022.
- Y. Fujita et al., “End-to-End Neural Speaker Diarization with Permutation-Free Objectives,” in Proc. Interspeech, 2019.
- S. Horiguchi et al., “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, 2022.
- I. Medennikov et al., “Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” in Proc. Interspeech, 2020, pp. 274–278. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-1602
- K. Kinoshita et al., “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” in ICASSP. IEEE, 2021.
- M. Delcroix et al., “Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization,” in Proc. INTERSPEECH, 2023.
- H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in Proc. INTERSPEECH, 2023.
- N. Zeghidour et al., “DIVE: End-to-end speech diarization via iterative speaker embedding,” in ASRU. IEEE, 2021.
- Z. Chen et al., “Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor,” in Proc. INTERSPEECH, 2023, pp. 3552–3556.
- Y. Fujita et al., “End-to-end neural speaker diarization with self-attention,” in ASRU. IEEE, 2019, pp. 296–303.
- S. Horiguchi et al., “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors,” Interspeech, 2020.
- E. Han et al., “BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers,” in ICASSP. IEEE, 2021.
- Y. Xue et al., “Online end-to-end neural diarization with speaker-tracing buffer,” in SLT. IEEE, 2021.
- S. Horiguchi et al., “Multi-channel end-to-end neural diarization with distributed microphones,” in ICASSP. IEEE, 2022.
- ——, “Mutual Learning of Single-and Multi-Channel End-to-End Neural Diarization,” in SLT. IEEE, 2023.
- A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020.
- Y. C. Liu et al., “End-to-End Neural Diarization: From Transformer to Conformer,” in Proc. Interspeech, 2021.
- T.-Y. Leung et al., “Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty,” in Proc. Interspeech, 2021.
- A. Jaegle et al., “Perceiver: General perception with iterative attention,” in International conference on machine learning. PMLR, 2021.
- A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Z. Pan et al., “Towards End-to-end Speaker Diarization in the Wild,” arXiv preprint arXiv:2211.01299, 2022.
- S. J. Broughton et al., “Improving End-to-End Neural Diarization Using Conversational Summary Representations,” in Interspeech, 2023.
- M. Rybicka et al., “End-to-end neural speaker diarization with an iterative refinement of non-autoregressive attention-based attractors,” in Proc. Interspeech, vol. 2022, 2022, pp. 5090–5094.
- Y. Fujita et al., “Neural Diarization with Non-Autoregressive Intermediate Attractors,” in ICASSP. IEEE, 2023.
- F. Hao et al., “End-to-end neural speaker diarization with an iterative adaptive attractor estimation,” Neural Networks, vol. 166, pp. 566–578, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S089360802300401X
- Z. Chen et al., “Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer,” arXiv preprint arXiv:2309.06672, 2023.
- Y. Yu et al., “Auxiliary loss of transformer with residual connection for end-to-end speaker diarization,” in ICASSP. IEEE, 2022.
- Y.-R. Jeoung et al., “Improving Transformer-Based End-to-End Speaker Diarization by Assigning Auxiliary Losses to Attention Heads,” in ICASSP. IEEE, 2023.
- N. Yamashita et al., “Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization,” in Proc. The Speaker and Language Recognition Workshop (Odyssey), 2022.
- F. Landini et al., “From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization,” in Interspeech, 2022.
- ——, “Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization,” in ICASSP. IEEE, 2023.
- D. Graff et al., “Switchboard-2 phase I, LDC98S75,” 1998.
- ——, “Switchboard-2 phase II, LDC99S79,” Web Download. Philadelphia: LDC, 1999.
- ——, “Switchboard-2 phase III, LDC2002S06,” Web Download. Philadelphia: LDC, 2002.
- ——, “Switchboard Cellular Part 1 audio LDC2001S13,” Web Download. Philadelphia: LDC, 2001.
- ——, “Switchboard Cellular Part 2 audio LDC2004S07,” Web Download. Philadelphia: LDC, 2004.
- N. M. I. Group, “2004 NIST SRE LDC2006S44,” 2006.
- ——, “2005 NIST SRE Training Data LDC2011S01,” 2006.
- ——, “2005 NIST SRE Test Data LDC2011S04,” 2011.
- ——, “2006 NIST SRE Evaluation Test Set Part 1 LDC2011S10,” 2011.
- ——, “2006 NIST SRE training Set LDC2011S09,” 2011.
- ——, “2006 NIST SRE Evaluation Test Set Part 2 LDC2012S01,” 2012.
- ——, “2008 NIST SRE Training Set Part 1 LDC2011S05,” 2011.
- ——, “2008 NIST SRE Test Set LDC2011S08,” 2011.
- D. Snyder et al., “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
- V. Panayotov et al., “Librispeech: an asr corpus based on public domain audio books,” in ICASSP. IEEE, 2015.
- M. Przybocki et al., “NIST SRE LDC2001S97,” Philadelphia, New Jersey: Linguistic Data Consortium, 2001.
- “NIST SRE 2000 Evaluation Plan,” https://www.nist.gov/sites/default/files/documents/2017/09/26/spk-2000-plan-v1.0.htm_.pdf.
- N. Ryant et al., “The Third DIHARD Diarization Challenge,” in Proc. Interspeech, 2021, pp. 3570–3574.
- Y. Fu et al., “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario,” in Proc. Interspeech, 2021.
- F. Yu et al., “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in ICASSP. IEEE, 2022.
- J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction. Springer, 2006, pp. 28–39.
- W. Kraaij et al., “The AMI meeting corpus,” in Proc. International Conference on Methods and Techniques in Behavioral Research, 2005.
- F. Landini et al., “Bayesian HMM Clustering of x-vector Sequences (VBx) in Speaker Diarization: Theory, Implementation and Analysis on Standard Tasks,” Computer Speech & Language, vol. 71, 2022.
- S. Watanabe et al., “CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings,” in Proc. 6th International Workshop on Speech Processing in Everyday Environments, 2020.
- S. Cornell et al., “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios,” arXiv preprint arXiv:2306.13734, 2023.
- N. Ryant et al., “Second DIHARD challenge evaluation plan,” Linguistic Data Consortium, Tech. Rep, 2019.
- M. Van Segbroeck et al., “DiPCo–Dinner Party Corpus,” arXiv preprint arXiv:1909.13447, 2019.
- L. Brandschain et al., “The Mixer 6 corpus: Resources for cross-channel and text independent speaker recognition,” in Proc. of LREC, 2010.
- T. Liu et al., “MSDWild: Multi-modal Speaker Diarization Dataset in the Wild,” in Proc. Interspeech, 2022.
- Z. Yang et al., “Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset,” in Interspeech, 2022.
- J. S. Chung et al., “Spot the Conversation: Speaker Diarisation in the Wild,” in Proc. Interspeech, 2020, pp. 299–303.
- H. Bredin et al., “pyannote.audio: neural building blocks for speaker diarization,” in IEEE ICASSP, 2020.
- S. Otterson et al., “Efficient use of overlap information in speaker diarization,” in ASRU. IEEE, 2007, pp. 683–686.
- D. Klement et al., “Discriminative Training of VBx Diarization,” arXiv preprint arXiv:2310.02732, 2023.
- D. P. Kingma et al., “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- “NIST Rich Transcription Evaluations,” https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation, version: md-eval-v22.pl.
- S. Maiti et al., “End-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings,” in ICASSP. IEEE, 2021, pp. 7183–7187.
- J. Wang et al., “TOLD: a Novel Two-Stage Overlap-Aware Framework for Speaker Diarization,” in ICASSP. IEEE, 2023.
- Z. Du et al., “Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information,” arXiv preprint arXiv:2111.13694, 2021.
- A. Plaquet et al., “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. INTERSPEECH, 2023.
- D. Wang et al., “Target speaker voice activity detection with transformers and its integration with end-to-end neural diarization,” in ICASSP. IEEE, 2023.
- K. Kinoshita et al., “Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech,” in Proc. Interspeech, 2021, pp. 3565–3569.
- S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- K. Kinoshita et al., “Utterance-by-utterance overlap-aware neural diarization with Graph-PIT,” in Proc. Interspeech, 2022.
- S. Horiguchi et al., “Towards neural diarization for unlimited numbers of speakers using global and local attractors,” in ASRU. IEEE, 2021.
- Y. Chen et al., “Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization,” in Interspeech, 2022.
- N. Kamo et al., “NTT Multi-Speaker ASR System for the DASR Task of CHiME-7 Challenge,” CHiME-7 Challenge, 2023.
- M.-K. He et al., “ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-speaker Embedding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- L. Ye et al., “The IACAS-Thinkit System for CHiME-7 Challenge,” CHiME-7 Challenge, 2023.
- S. Baroudi et al., “pyannote. audio speaker diarization pipeline at VoxSRC 2023,” The VoxCeleb Speaker Recognition Challenge, 2023.
- S. Horiguchi et al., “End-to-end speaker diarization as post-processing,” in ICASSP. IEEE, 2021, pp. 7188–7192.
- ——, “The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap,” arXiv preprint arXiv:2102.01363, 2021.
- R. Wang et al., “The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,” arXiv preprint arXiv:2308.14638, 2023.
- T. Liu et al., “BER: Balanced Error Rate For Speaker Diarization,” arXiv preprint arXiv:2211.04304, 2022.
- D. Karamyan et al., “The Krisp Diarization system for the VoxCeleb Speaker Recognition Challenge 2023,” The VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23), 2023.
- D. Raj et al., “GPU-accelerated Guided Source Separation for Meeting Transcription,” in Proc. INTERSPEECH, 2023.
- D. Wang et al., “Profile-Error-Tolerant Target-Speaker Voice Activity Detection,” arXiv preprint arXiv:2309.12521, 2023.
- Federico Landini (32 papers)
- Mireia Diez (17 papers)
- Themos Stafylakis (35 papers)
- Lukáš Burget (45 papers)