Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformer Attractors for Robust and Efficient End-to-End Neural Diarization (2312.06253v1)

Published 11 Dec 2023 in cs.SD and eess.AS

Abstract: End-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) is a method to perform diarization in a single neural network. EDA handles the diarization of a flexible number of speakers by using an LSTM-based encoder-decoder that generates a set of speaker-wise attractors in an autoregressive manner. In this paper, we propose to replace EDA with a transformer-based attractor calculation (TA) module. TA is composed of a Combiner block and a Transformer decoder. The main function of the combiner block is to generate conversational dependent (CD) embeddings by incorporating learned conversational information into a global set of embeddings. These CD embeddings will then serve as the input for the transformer decoder. Results on public datasets show that EEND-TA achieves 2.68% absolute DER improvement over EEND-EDA. EEND-TA inference is 1.28 times faster than that of EEND-EDA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. “The Third DIHARD Diarization Challenge,” in Proc. Interspeech 2021, 2021, pp. 3570–3574.
  2. “Spot the Conversation: Speaker Diarisation in the Wild,” in Proc. Interspeech 2020, 2020, pp. 299–303.
  3. “Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech,” in Proc. Interspeech 2021, 2021, pp. 3565–3569.
  4. “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in Proc. ICASSP 2022. IEEE, 2022, pp. 6167–6171.
  5. “Speaker diarization: A review of recent research,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 2, pp. 356–370, 2012.
  6. “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
  7. “End-to-end neural speaker diarization with self-attention,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 296–303.
  8. “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.
  9. “End-to-End Neural Speaker Diarization with Permutation-free Objectives,” in Interspeech, 2019, pp. 4300–4304.
  10. “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1493–1507, 2022.
  11. “Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty,” in Proc. Interspeech 2021, 2021, pp. 3575–3579.
  12. “Neural diarization with non-autoregressive intermediate attractors,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  13. “Towards neural diarization for unlimited numbers of speakers using global and local attractors,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 98–105.
  14. “Online end-to-end neural diarization with speaker-tracing buffer,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 841–848.
  15. “Towards end-to-end speaker diarization in the wild,” arXiv preprint arXiv:2211.01299, 2022.
  16. “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015.
  17. “Improving End-to-End Neural Diarization Using Conversational Summary Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 3157–3161.
  18. “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017, NIPS’17, p. 6000–6010, Curran Associates Inc.
  19. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  20. “Improving language understanding by generative pre-training,” OpenAI, 2018.
  21. “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  22. “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  23. “Decision transformer: Reinforcement learning via sequence modeling,” Advances in neural information processing systems, vol. 34, pp. 15084–15097, 2021.
  24. “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
  25. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  26. “End-to-end neural speaker diarization with an iterative refinement of non-autoregressive attention-based attractors,” in Proc. Interspeech, 2022, vol. 2022, pp. 5090–5094.
  27. “Eend-ss: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 480–487.
  28. “Auxiliary loss of transformer with residual connection for end-to-end speaker diarization,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8377–8381.
  29. “The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap,” arXiv preprint arXiv:2102.01363, 2021.
  30. “End-to-end speaker diarization system for the third DIHARD challenge system description,” 2021.
  31. “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, 2016.
  32. “Subspace LHUC for Fast Adaptation of Deep Neural Network Acoustic Models,” in Proc. Interspeech 2016, 2016, pp. 1593–1597.
  33. “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  34. “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  35. “Third DIHARD challenge evaluation plan,” arXiv preprint arXiv:2006.05815, 2020.
  36. “Open source Magicdata-RAMC: A rich annotated mandarin conversational (RAMC) speech dataset,” arXiv preprint arXiv:2203.16844, 2022.
  37. “The AMI meeting corpus: A pre-announcement,” in Machine Learning for Multimodal Interaction: Second International Workshop, MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers 2. Springer, 2006, pp. 28–39.
  38. “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, Conference Track Proceedings, 2015.
Citations (6)

Summary

We haven't generated a summary for this paper yet.