EEND-M2F: Masked-attention mask transformers for speaker diarization (2401.12600v1)
Abstract: In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is lightweight, efficient, and truly end-to-end, as it does not require any additional diarization, speaker verification, or segmentation models to run, nor does it require running any clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- pyannote. audio speaker diarization pipeline at voxsrc 2023.
- Bredin, H. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. INTERSPEECH 2023, pp. 1983–1987, 2023. doi: 10.21437/Interspeech.2023-105.
- Improving End-to-End Neural Diarization Using Conversational Summary Representations. In Proc. INTERSPEECH 2023, pp. 3157–3161, 2023. doi: 10.21437/Interspeech.2023-2401.
- End-to-end object detection with transformers. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, pp. 213–229, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58451-1. doi: 10.1007/978-3-030-58452-8˙13. URL https://doi.org/10.1007/978-3-030-58452-8_13.
- Attention-based encoder-decoder end-to-end neural diarization with embedding enhancer. arXiv preprint arXiv:2309.06672, 2023.
- Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021a.
- Per-pixel classification is not all you need for semantic segmentation. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 17864–17875. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/950a4152c2b4aa3ad78bdd6b366cc179-Paper.pdf.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1290–1299, June 2022.
- Target-speaker voice activity detection via sequence-to-sequence prediction. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10094752.
- Spot the Conversation: Speaker Diarisation in the Wild. In Proc. Interspeech 2020, pp. 299–303, 2020. doi: 10.21437/Interspeech.2020-2337.
- Bayesian HMM Based x-Vector Clustering for Speaker Diarization. In Proc. Interspeech 2019, pp. 346–350, 2019. doi: 10.21437/Interspeech.2019-2813.
- AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario. In Proc. Interspeech 2021, pp. 3665–3669, 2021. doi: 10.21437/Interspeech.2021-1397.
- End-to-End Neural Speaker Diarization with Permutation-Free Objectives. In Proc. Interspeech 2019, pp. 4300–4304, 2019a. doi: 10.21437/Interspeech.2019-2899.
- End-to-end neural speaker diarization with self-attention. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 296–303, 2019b. doi: 10.1109/ASRU46091.2019.9003959.
- Neural diarization with non-autoregressive intermediate attractors. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Girshick, R. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448, 2015. doi: 10.1109/ICCV.2015.169.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pp. 5036–5040, 2020. doi: 10.21437/Interspeech.2020-3015.
- Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Ansd-ma-mse: Adaptive neural speaker diarization using memory-aware multi-speaker embedding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1561–1573, 2023. doi: 10.1109/TASLP.2023.3265199.
- End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. In Proc. Interspeech 2020, pp. 269–273, 2020. doi: 10.21437/Interspeech.2020-1022.
- Towards neural diarization for unlimited numbers of speakers using global and local attractors. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings, 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings, pp. 98–105. Institute of Electrical and Electronics Engineers Inc., 2021. doi: 10.1109/ASRU51503.2021.9687875. Publisher Copyright: © 2021 IEEE.; 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 ; Conference date: 13-12-2021 Through 17-12-2021.
- Encoder-decoder based attractors for end-to-end neural diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1493–1507, 2022. doi: 10.1109/TASLP.2022.3162080.
- OneFormer: One Transformer to Rule Universal Image Segmentation. 2023.
- Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7198–7202, 2021a. doi: 10.1109/ICASSP39728.2021.9414333.
- Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech. In Proc. Interspeech 2021, pp. 3565–3569, 2021b. doi: 10.21437/Interspeech.2021-1004.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- The ami meeting corpus. In Proc. International Conference on Methods and Techniques in Behavioral Research, 2005.
- Kuhn, H. W. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109.
- Bayesian hmm clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech & Language, 71:101254, 2022.
- Diaper: End-to-end neural diarization with perceiver-based attractors. arXiv preprint arXiv:2312.04324, 2023.
- Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty. In Proc. Interspeech 2021, pp. 3575–3579, 2021. doi: 10.21437/Interspeech.2021-1377.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3041–3050, June 2023.
- Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oMI9PjOb9Jl.
- End-to-End Neural Diarization: From Transformer to Conformer. In Proc. Interspeech 2021, pp. 3081–3085, 2021. doi: 10.21437/Interspeech.2021-1909.
- Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440, Los Alamitos, CA, USA, jun 2015. IEEE Computer Society. doi: 10.1109/CVPR.2015.7298965. URL https://doi.ieeecomputersociety.org/10.1109/CVPR.2015.7298965.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario. In Proc. Interspeech 2020, pp. 274–278, 2020. doi: 10.21437/Interspeech.2020-1602.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571, 2016. doi: 10.1109/3DV.2016.79.
- NIST. The 2000 NIST speaker recognition evaluation plan. Technical report, 2009a.
- NIST. The 2009 (rt-09) rich transcription meeting recognition evaluation plan. Technical report, 2009b.
- Towards end-to-end speaker diarization in the wild. arXiv preprint arXiv:2211.01299, 2022.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
- A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72:101317, 2022. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2021.101317. URL https://www.sciencedirect.com/science/article/pii/S0885230821001121.
- Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023, pp. 3222–3226, 2023. doi: 10.21437/Interspeech.2023-205.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
- End-to-end neural speaker diarization with an iterative refinement of non-autoregressive attention-based attractors. In Proc. Interspeech, volume 2022, pp. 5090–5094, 2022.
- Transformer attractors for robust and efficient end-to-end neural diarization. arXiv preprint arXiv:2312.06253, 2023.
- Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. 2023.
- Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826, 2016. doi: 10.1109/CVPR.2016.308.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
- Target speaker voice activity detection with transformers and its integration with end-to-end neural diarization. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10095185.
- Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset. In Proc. Interspeech 2022, pp. 1736–1740, 2022. doi: 10.21437/Interspeech.2022-729.
- M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge. In Proc. ICASSP. IEEE, 2022a.
- k-means Mask Transformer. In ECCV, 2022b.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
- Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6748–6758, October 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.