Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 159 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition (2309.11730v2)

Published 21 Sep 2023 in eess.AS and cs.SD

Abstract: Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Voxceleb: A large-scale speaker identification dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620.
  2. “Voxceleb2: Deep speaker recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
  3. “Cn-celeb: a challenging chinese speaker recognition dataset,” in ICASSP 2020. IEEE, 2020, pp. 7604–7608.
  4. “Cn-celeb: multi-genre speaker recognition,” Speech Communication, vol. 137, pp. 77–91, 2022.
  5. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  6. “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Interspeech 2019, 2019, pp. 3465–3469.
  7. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  8. “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  9. “Large-scale self-supervised speech representation learning for automatic speaker verification,” in ICASSP 2022. IEEE, 2022, pp. 6147–6151.
  10. “Parameter-efficient transfer learning of pre-trained transformer models for speaker verification using adapters,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
  11. “Towards a unified conformer structure: from asr to asv task,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
  12. “Pretraining conformer with asr for speaker verification,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
  13. “A simple framework for contrastive learning of visual representations,” in ICML. PMLR, 2020, pp. 1597–1607.
  14. “Momentum contrast for unsupervised visual representation learning,” in Proc. CVPR, 2020, pp. 9729–9738.
  15. “Emerging properties in self-supervised vision transformers,” in Proc. ICCV, 2021, pp. 9650–9660.
  16. “Self-supervised curriculum learning for speaker verification,” arXiv preprint arXiv:2203.14525, 2022.
  17. “A comprehensive study on self-supervised distillation for speaker representation learning,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 599–604.
  18. “C3-dino: Joint contrastive and non-contrastive self-supervised learning for speaker verification,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1273–1283, 2022.
  19. “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP 2022. IEEE, 2022.
  20. “Improving speaker verification with self-pretrained transformer models,” in Proc. Interspeech 2023, 2023, pp. 5361–5365.
  21. “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Proc. Interspeech 2020, 2020, pp. 3830–3834.
  22. “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
  23. Yi Luo and Jianwei Yu, “Music source separation with band-split rnn,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  24. “The idlab voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in dnn based speaker verification,” in ICASSP 2021. IEEE, 2021, pp. 5814–5818.
  25. “Analysis of score normalization in multilingual speaker recognition,” in Proc. Interspeech 2017, 2017, pp. 1567–1571.
  26. “Augmentation adversarial training for self-supervised speaker representation learning,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1253–1262, 2022.
  27. “Improving dino-based self-supervised speaker verification with progressive cluster-aware training,” in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2023, pp. 1–5.
  28. “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NeurIPS, 2017, vol. 30.

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.