Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

3D-Speaker-Toolkit: An Open-Source Toolkit for Multimodal Speaker Verification and Diarization (2403.19971v3)

Published 29 Mar 2024 in eess.AS and eess.SP

Abstract: We introduce 3D-Speaker-Toolkit, an open-source toolkit for multimodal speaker verification and diarization, designed for meeting the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acoustic module extracts speaker embeddings from acoustic features, employing both fully-supervised and self-supervised learning approaches. The semantic module leverages advanced LLMs to comprehend the substance and context of spoken language, thereby augmenting the system's proficiency in distinguishing speakers through linguistic patterns. The visual module applies image processing technologies to scrutinize facial features, which bolsters the precision of speaker diarization in multi-speaker environments. Collectively, these modules empower the 3D-Speaker-Toolkit to achieve substantially improved accuracy and reliability in speaker-related tasks. With 3D-Speaker-Toolkit, we establish a new benchmark for multimodal speaker analysis. The toolkit also includes a handful of open-source state-of-the-art models and a large-scale dataset containing over 10,000 speakers. The toolkit is publicly available at https://github.com/modelscope/3D-Speaker.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. 2018, pp. 5329–5333, IEEE.
  2. “Generalized end-to-end loss for speaker verification,” in ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. 2018, pp. 4879–4883, IEEE.
  3. “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech 2020, Shanghai, China, 25-29 October 2020. pp. 3830–3834, ISCA.
  4. “Phonetically-aware coupled network for short duration text-independent speaker verification,” in Interspeech 2020, Virtual Event, Shanghai, China, 25-29 October 2020. 2020, pp. 926–930, ISCA.
  5. “Improved meta-learning training for speaker verification,” in Interspeech 2021, Brno, Czechia, 30 August - 3 September 2021. 2021, pp. 1049–1053, ISCA.
  6. “Resnext and res2net structures for speaker verification,” in SLT 2021, Shenzhen, China, January 19-22, 2021. 2021, pp. 301–307, IEEE.
  7. “Disentangling voice and content with self-supervision for speaker recognition,” in NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  8. “A review of speaker diarization: Recent advances with deep learning,” Comput. Speech Lang., vol. 72, pp. 101317, 2022.
  9. “Neural target speech extraction: An overview,” IEEE Signal Process. Mag., vol. 40, no. 3, pp. 8–29, 2023.
  10. Fan Yu et al., “M2met: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” CoRR, vol. abs/2110.07393, 2021.
  11. Fan Yu et al., “Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge,” in ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 2022, pp. 9156–9160, IEEE.
  12. “Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. 2022, vol. 162, pp. 2709–2720, PMLR.
  13. “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Interspeech 2019, Graz, Austria, 15-19 September 2019. 2019, pp. 2728–2732, ISCA.
  14. Jiaming Wang et al., “Lauragpt: Listen, attend, understand, and regenerate audio with GPT,” CoRR, vol. abs/2310.04673, 2023.
  15. “Autoencoder-based semi-supervised curriculum learning for out-of-domain speaker verification,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019. 2019, pp. 4360–4364, ISCA.
  16. “Towards a fault-tolerant speaker verification system: A regularization approach to reduce the condition number,” in Interspeech 2019, Graz, Austria, 15-19 September 2019. 2019, pp. 4065–4069, ISCA.
  17. “C3-DINO: joint contrastive and non-contrastive self-supervised learning for speaker verification,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1273–1283, 2022.
  18. “A comprehensive study on self-supervised distillation for speaker representation learning,” in SLT 2022. 2022, pp. 599–604, IEEE.
  19. “Augmentation adversarial training for self-supervised speaker representation learning,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1253–1262, 2022.
  20. “Self-supervised speaker verification with simple siamese network and self-supervised regularization,” in ICASSP 2022. 2022, pp. 6127–6131, IEEE.
  21. “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
  22. “Espnet: End-to-end speech processing toolkit,” in Interspeech 2018, Hyderabad, India, 2-6 September 2018. 2018, pp. 2207–2211, ISCA.
  23. “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
  24. “In defence of metric learning for speaker recognition,” in Interspeech 2020, Shanghai, China, 25-29 October 2020. 2020, pp. 2977–2981, ISCA.
  25. “ASV-SUBTOOLS: open source toolkit for automatic speaker verification,” in ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 2021, pp. 6184–6188, IEEE.
  26. “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. 2023, pp. 1–5, IEEE.
  27. “Exploring speaker-related information in spoken language understanding for better speaker diarization,” in Findings of the ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. 14068–14077.
  28. “Improving speaker diarization using semantic information: Joint pairwise constraints propagation,” CoRR, vol. abs/2309.10456, 2023.
  29. “Ava-avd: Audio-visual speaker diarization in the wild,” 2022, MM ’22, p. 3838–3847.
  30. “3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect corpus for speech representation disentanglement,” CoRR, vol. abs/2306.15354, 2023.
  31. “CAM++: A fast and efficient network for speaker verification using context-aware masking,” CoRR, vol. abs/2303.00332, 2023.
  32. “An enhanced res2net with local and global feature fusion for speaker verification,” CoRR, vol. abs/2305.12838, 2023.
  33. “Res2net: A new multi-scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, 2021.
  34. “Deep residual learning for image recognition,” in CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. 2016, pp. 770–778, IEEE Computer Society.
  35. “Voxceleb: A large-scale speaker identification dataset,” in Interspeech 2017, Stockholm, Sweden, August 20-24, 2017. 2017, pp. 2616–2620, ISCA.
  36. “Voxceleb2: Deep speaker recognition,” in Interspeech 2018, Hyderabad, India, 2-6 September 2018. 2018, pp. 1086–1090, ISCA.
  37. “Cn-celeb: A challenging chinese speaker recognition dataset,” in ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 2020, pp. 7604–7608, IEEE.
  38. “Cn-celeb: Multi-genre speaker recognition,” Speech Commun., vol. 137, pp. 77–91, 2022.
  39. “Emerging properties in self-supervised vision transformers,” in ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. 2021, pp. 9630–9640, IEEE.
  40. “Pushing the limits of self-supervised speaker verification using regularized distillation framework,” in ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. pp. 1–5, IEEE.
  41. “Self-distillation network with ensemble prototypes: Learning robust speaker representations without supervision,” CoRR, vol. abs/2308.02774, 2023.
  42. “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 8024–8035.
  43. “Densely connected time delay neural network for speaker verification,” in Interspeech 2020, Virtual Event, Shanghai, China, 25-29 October 2020. 2020, pp. 921–925, ISCA.
  44. “Cam: Context-aware masking for robust speaker verification,” in ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 2021, pp. 6703–6707, IEEE.
  45. “Ponet: Pooling network for efficient token mixing in long sequences,” in ICLR 2022, Virtual Event, April 25-29, 2022. 2022, OpenReview.net.
  46. “Analysis of score normalization in multilingual speaker recognition,” in Interspeech 2017, Stockholm, Sweden, August 20-24, 2017. 2017, pp. 1567–1571, ISCA.
  47. “Speaker recognition based on deep learning: An overview,” Neural networks : the official journal of the International Neural Network Society, vol. 140, pp. 65–99, 2020.
  48. “The 2019 nist audio-visual speaker recognition evaluation,” 2020-05-18 2020, The Speaker and Language Recognition Workshop: Odyssey 2020, Tokyo.
  49. “A review of speaker diarization: Recent advances with deep learning,” ArXiv, vol. abs/2101.09624, 2021.
  50. “End-to-end audio-visual neural speaker diarization,” in Interspeech, 2022.
  51. “Who said that?: Audio-visual speaker diarisation of real-world meetings,” in Interspeech, 2019.
  52. “Spot the conversation: speaker diarisation in the wild,” in Interspeech, 2020.
  53. “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, p. 3927–3935.
  54. “AISHELL-4: an open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,” CoRR, vol. abs/2104.03603, 2021.
  55. “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 2019, pp. 4171–4186, Association for Computational Linguistics.
  56. “The idlab voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification,” in ICASSP 2021. 2021, pp. 5814–5818, IEEE.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube