Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors (2312.04324v3)

Published 7 Dec 2023 in eess.AS and cs.SD

Abstract: Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. G. Sell et al., “Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.” in Interspeech, 2018, pp. 2808–2812.
  2. F. Landini et al., “BUT System for the Second DIHARD Speech Diarization Challenge,” in ICASSP.   IEEE, 2020.
  3. T. J. Park et al., “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Processing Letters, vol. 27, pp. 381–385, 2019.
  4. ——, “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, p. 101317, 2022.
  5. Y. Fujita et al., “End-to-End Neural Speaker Diarization with Permutation-Free Objectives,” in Proc. Interspeech, 2019.
  6. S. Horiguchi et al., “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, 2022.
  7. I. Medennikov et al., “Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” in Proc. Interspeech, 2020, pp. 274–278. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-1602
  8. K. Kinoshita et al., “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” in ICASSP.   IEEE, 2021.
  9. M. Delcroix et al., “Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization,” in Proc. INTERSPEECH, 2023.
  10. H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in Proc. INTERSPEECH, 2023.
  11. N. Zeghidour et al., “DIVE: End-to-end speech diarization via iterative speaker embedding,” in ASRU.   IEEE, 2021.
  12. Z. Chen et al., “Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor,” in Proc. INTERSPEECH, 2023, pp. 3552–3556.
  13. Y. Fujita et al., “End-to-end neural speaker diarization with self-attention,” in ASRU.   IEEE, 2019, pp. 296–303.
  14. S. Horiguchi et al., “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors,” Interspeech, 2020.
  15. E. Han et al., “BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers,” in ICASSP.   IEEE, 2021.
  16. Y. Xue et al., “Online end-to-end neural diarization with speaker-tracing buffer,” in SLT.   IEEE, 2021.
  17. S. Horiguchi et al., “Multi-channel end-to-end neural diarization with distributed microphones,” in ICASSP.   IEEE, 2022.
  18. ——, “Mutual Learning of Single-and Multi-Channel End-to-End Neural Diarization,” in SLT.   IEEE, 2023.
  19. A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020.
  20. Y. C. Liu et al., “End-to-End Neural Diarization: From Transformer to Conformer,” in Proc. Interspeech, 2021.
  21. T.-Y. Leung et al., “Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty,” in Proc. Interspeech, 2021.
  22. A. Jaegle et al., “Perceiver: General perception with iterative attention,” in International conference on machine learning.   PMLR, 2021.
  23. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  24. Z. Pan et al., “Towards End-to-end Speaker Diarization in the Wild,” arXiv preprint arXiv:2211.01299, 2022.
  25. S. J. Broughton et al., “Improving End-to-End Neural Diarization Using Conversational Summary Representations,” in Interspeech, 2023.
  26. M. Rybicka et al., “End-to-end neural speaker diarization with an iterative refinement of non-autoregressive attention-based attractors,” in Proc. Interspeech, vol. 2022, 2022, pp. 5090–5094.
  27. Y. Fujita et al., “Neural Diarization with Non-Autoregressive Intermediate Attractors,” in ICASSP.   IEEE, 2023.
  28. F. Hao et al., “End-to-end neural speaker diarization with an iterative adaptive attractor estimation,” Neural Networks, vol. 166, pp. 566–578, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S089360802300401X
  29. Z. Chen et al., “Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer,” arXiv preprint arXiv:2309.06672, 2023.
  30. Y. Yu et al., “Auxiliary loss of transformer with residual connection for end-to-end speaker diarization,” in ICASSP.   IEEE, 2022.
  31. Y.-R. Jeoung et al., “Improving Transformer-Based End-to-End Speaker Diarization by Assigning Auxiliary Losses to Attention Heads,” in ICASSP.   IEEE, 2023.
  32. N. Yamashita et al., “Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization,” in Proc. The Speaker and Language Recognition Workshop (Odyssey), 2022.
  33. F. Landini et al., “From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization,” in Interspeech, 2022.
  34. ——, “Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization,” in ICASSP.   IEEE, 2023.
  35. D. Graff et al., “Switchboard-2 phase I, LDC98S75,” 1998.
  36. ——, “Switchboard-2 phase II, LDC99S79,” Web Download. Philadelphia: LDC, 1999.
  37. ——, “Switchboard-2 phase III, LDC2002S06,” Web Download. Philadelphia: LDC, 2002.
  38. ——, “Switchboard Cellular Part 1 audio LDC2001S13,” Web Download. Philadelphia: LDC, 2001.
  39. ——, “Switchboard Cellular Part 2 audio LDC2004S07,” Web Download. Philadelphia: LDC, 2004.
  40. N. M. I. Group, “2004 NIST SRE LDC2006S44,” 2006.
  41. ——, “2005 NIST SRE Training Data LDC2011S01,” 2006.
  42. ——, “2005 NIST SRE Test Data LDC2011S04,” 2011.
  43. ——, “2006 NIST SRE Evaluation Test Set Part 1 LDC2011S10,” 2011.
  44. ——, “2006 NIST SRE training Set LDC2011S09,” 2011.
  45. ——, “2006 NIST SRE Evaluation Test Set Part 2 LDC2012S01,” 2012.
  46. ——, “2008 NIST SRE Training Set Part 1 LDC2011S05,” 2011.
  47. ——, “2008 NIST SRE Test Set LDC2011S08,” 2011.
  48. D. Snyder et al., “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  49. V. Panayotov et al., “Librispeech: an asr corpus based on public domain audio books,” in ICASSP.   IEEE, 2015.
  50. M. Przybocki et al., “NIST SRE LDC2001S97,” Philadelphia, New Jersey: Linguistic Data Consortium, 2001.
  51. “NIST SRE 2000 Evaluation Plan,” https://www.nist.gov/sites/default/files/documents/2017/09/26/spk-2000-plan-v1.0.htm_.pdf.
  52. N. Ryant et al., “The Third DIHARD Diarization Challenge,” in Proc. Interspeech, 2021, pp. 3570–3574.
  53. Y. Fu et al., “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario,” in Proc. Interspeech, 2021.
  54. F. Yu et al., “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in ICASSP.   IEEE, 2022.
  55. J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction.   Springer, 2006, pp. 28–39.
  56. W. Kraaij et al., “The AMI meeting corpus,” in Proc. International Conference on Methods and Techniques in Behavioral Research, 2005.
  57. F. Landini et al., “Bayesian HMM Clustering of x-vector Sequences (VBx) in Speaker Diarization: Theory, Implementation and Analysis on Standard Tasks,” Computer Speech & Language, vol. 71, 2022.
  58. S. Watanabe et al., “CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings,” in Proc. 6th International Workshop on Speech Processing in Everyday Environments, 2020.
  59. S. Cornell et al., “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios,” arXiv preprint arXiv:2306.13734, 2023.
  60. N. Ryant et al., “Second DIHARD challenge evaluation plan,” Linguistic Data Consortium, Tech. Rep, 2019.
  61. M. Van Segbroeck et al., “DiPCo–Dinner Party Corpus,” arXiv preprint arXiv:1909.13447, 2019.
  62. L. Brandschain et al., “The Mixer 6 corpus: Resources for cross-channel and text independent speaker recognition,” in Proc. of LREC, 2010.
  63. T. Liu et al., “MSDWild: Multi-modal Speaker Diarization Dataset in the Wild,” in Proc. Interspeech, 2022.
  64. Z. Yang et al., “Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset,” in Interspeech, 2022.
  65. J. S. Chung et al., “Spot the Conversation: Speaker Diarisation in the Wild,” in Proc. Interspeech, 2020, pp. 299–303.
  66. H. Bredin et al., “pyannote.audio: neural building blocks for speaker diarization,” in IEEE ICASSP, 2020.
  67. S. Otterson et al., “Efficient use of overlap information in speaker diarization,” in ASRU.   IEEE, 2007, pp. 683–686.
  68. D. Klement et al., “Discriminative Training of VBx Diarization,” arXiv preprint arXiv:2310.02732, 2023.
  69. D. P. Kingma et al., “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  70. “NIST Rich Transcription Evaluations,” https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation, version: md-eval-v22.pl.
  71. S. Maiti et al., “End-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings,” in ICASSP.   IEEE, 2021, pp. 7183–7187.
  72. J. Wang et al., “TOLD: a Novel Two-Stage Overlap-Aware Framework for Speaker Diarization,” in ICASSP.   IEEE, 2023.
  73. Z. Du et al., “Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information,” arXiv preprint arXiv:2111.13694, 2021.
  74. A. Plaquet et al., “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. INTERSPEECH, 2023.
  75. D. Wang et al., “Target speaker voice activity detection with transformers and its integration with end-to-end neural diarization,” in ICASSP.   IEEE, 2023.
  76. K. Kinoshita et al., “Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech,” in Proc. Interspeech, 2021, pp. 3565–3569.
  77. S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  78. K. Kinoshita et al., “Utterance-by-utterance overlap-aware neural diarization with Graph-PIT,” in Proc. Interspeech, 2022.
  79. S. Horiguchi et al., “Towards neural diarization for unlimited numbers of speakers using global and local attractors,” in ASRU.   IEEE, 2021.
  80. Y. Chen et al., “Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization,” in Interspeech, 2022.
  81. N. Kamo et al., “NTT Multi-Speaker ASR System for the DASR Task of CHiME-7 Challenge,” CHiME-7 Challenge, 2023.
  82. M.-K. He et al., “ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-speaker Embedding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  83. L. Ye et al., “The IACAS-Thinkit System for CHiME-7 Challenge,” CHiME-7 Challenge, 2023.
  84. S. Baroudi et al., “pyannote. audio speaker diarization pipeline at VoxSRC 2023,” The VoxCeleb Speaker Recognition Challenge, 2023.
  85. S. Horiguchi et al., “End-to-end speaker diarization as post-processing,” in ICASSP.   IEEE, 2021, pp. 7188–7192.
  86. ——, “The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap,” arXiv preprint arXiv:2102.01363, 2021.
  87. R. Wang et al., “The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge,” arXiv preprint arXiv:2308.14638, 2023.
  88. T. Liu et al., “BER: Balanced Error Rate For Speaker Diarization,” arXiv preprint arXiv:2211.04304, 2022.
  89. D. Karamyan et al., “The Krisp Diarization system for the VoxCeleb Speaker Recognition Challenge 2023,” The VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23), 2023.
  90. D. Raj et al., “GPU-accelerated Guided Source Separation for Meeting Transcription,” in Proc. INTERSPEECH, 2023.
  91. D. Wang et al., “Profile-Error-Tolerant Target-Speaker Voice Activity Detection,” arXiv preprint arXiv:2309.12521, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Federico Landini (32 papers)
  2. Mireia Diez (17 papers)
  3. Themos Stafylakis (35 papers)
  4. Lukáš Burget (45 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com