Transformer Attractors for Robust and Efficient End-to-End Neural Diarization (2312.06253v1)
Abstract: End-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) is a method to perform diarization in a single neural network. EDA handles the diarization of a flexible number of speakers by using an LSTM-based encoder-decoder that generates a set of speaker-wise attractors in an autoregressive manner. In this paper, we propose to replace EDA with a transformer-based attractor calculation (TA) module. TA is composed of a Combiner block and a Transformer decoder. The main function of the combiner block is to generate conversational dependent (CD) embeddings by incorporating learned conversational information into a global set of embeddings. These CD embeddings will then serve as the input for the transformer decoder. Results on public datasets show that EEND-TA achieves 2.68% absolute DER improvement over EEND-EDA. EEND-TA inference is 1.28 times faster than that of EEND-EDA.
- “The Third DIHARD Diarization Challenge,” in Proc. Interspeech 2021, 2021, pp. 3570–3574.
- “Spot the Conversation: Speaker Diarisation in the Wild,” in Proc. Interspeech 2020, 2020, pp. 299–303.
- “Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech,” in Proc. Interspeech 2021, 2021, pp. 3565–3569.
- “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in Proc. ICASSP 2022. IEEE, 2022, pp. 6167–6171.
- “Speaker diarization: A review of recent research,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 2, pp. 356–370, 2012.
- “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
- “End-to-end neural speaker diarization with self-attention,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 296–303.
- “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.
- “End-to-End Neural Speaker Diarization with Permutation-free Objectives,” in Interspeech, 2019, pp. 4300–4304.
- “Encoder-decoder based attractors for end-to-end neural diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1493–1507, 2022.
- “Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty,” in Proc. Interspeech 2021, 2021, pp. 3575–3579.
- “Neural diarization with non-autoregressive intermediate attractors,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Towards neural diarization for unlimited numbers of speakers using global and local attractors,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 98–105.
- “Online end-to-end neural diarization with speaker-tracing buffer,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 841–848.
- “Towards end-to-end speaker diarization in the wild,” arXiv preprint arXiv:2211.01299, 2022.
- “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015.
- “Improving End-to-End Neural Diarization Using Conversational Summary Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 3157–3161.
- “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017, NIPS’17, p. 6000–6010, Curran Associates Inc.
- “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- “Improving language understanding by generative pre-training,” OpenAI, 2018.
- “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- “Decision transformer: Reinforcement learning via sequence modeling,” Advances in neural information processing systems, vol. 34, pp. 15084–15097, 2021.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
- “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
- “End-to-end neural speaker diarization with an iterative refinement of non-autoregressive attention-based attractors,” in Proc. Interspeech, 2022, vol. 2022, pp. 5090–5094.
- “Eend-ss: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 480–487.
- “Auxiliary loss of transformer with residual connection for end-to-end speaker diarization,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8377–8381.
- “The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap,” arXiv preprint arXiv:2102.01363, 2021.
- “End-to-end speaker diarization system for the third DIHARD challenge system description,” 2021.
- “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, 2016.
- “Subspace LHUC for Fast Adaptation of Deep Neural Network Acoustic Models,” in Proc. Interspeech 2016, 2016, pp. 1593–1597.
- “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- “Third DIHARD challenge evaluation plan,” arXiv preprint arXiv:2006.05815, 2020.
- “Open source Magicdata-RAMC: A rich annotated mandarin conversational (RAMC) speech dataset,” arXiv preprint arXiv:2203.16844, 2022.
- “The AMI meeting corpus: A pre-announcement,” in Machine Learning for Multimodal Interaction: Second International Workshop, MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers 2. Springer, 2006, pp. 28–39.
- “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, Conference Track Proceedings, 2015.