Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Voice Conversion Augmentation for Speaker Recognition on Defective Datasets (2404.00863v1)

Published 1 Apr 2024 in eess.AS

Abstract: Modern speaker recognition system relies on abundant and balanced datasets for classification training. However, diverse defective datasets, such as partially-labelled, small-scale, and imbalanced datasets, are common in real-world applications. Previous works usually studied specific solutions for each scenario from the algorithm perspective. However, the root cause of these problems lies in dataset imperfections. To address these challenges with a unified solution, we propose the Voice Conversion Augmentation (VCA) strategy to obtain pseudo speech from the training set. Furthermore, to guarantee generation quality, we designed the VCA-NN~(nearest neighbours) strategy to select source speech from utterances that are close to the target speech in the representation space. Our experimental results on three created datasets demonstrated that VCA-NN effectively mitigates these dataset problems, which provides a new direction for handling the speaker recognition problems from the data aspect.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, pp. 12 – 40, 2010.
  2. “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19–41, 2000.
  3. “Two decades into speaker recognition evaluation - are we there yet?,” Computer Speech & Language, vol. 61, pp. 101058, 2020.
  4. “ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech, 2020, pp. 3830–3834.
  5. “CAM++: A fast and efficient network for speaker verification using context-aware masking,” arXiv preprint arXiv:2303.00332, 2023.
  6. “X-Vectors: robust DNN embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
  7. “Neural acoustic-phonetic approach for speaker verification with phonetic attention mask,” IEEE Signal Processing Letters (SPL), 2022.
  8. “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018, pp. 1086–1090.
  9. “VoxCeleb: A large-scale speaker identification dataset,” in Interspeech, 2017, pp. 2616–2620.
  10. “The speakers in the wild (SITW) speaker recognition database,” in Interspeech, 2016, pp. 818–822.
  11. “Disentangling voice and content with self-supervision for speaker recognition,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  12. “Self-supervised speaker recognition with loss-gated learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6142–6146.
  13. “Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 32, pp. 529–541, 2023.
  14. “Semi-supervised contrastive learning with generalized contrastive loss and its application to speaker recognition,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020, pp. 1641–1646.
  15. “Automatic speaker recognition with limited data,” in Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 340–348.
  16. V Karthikeyan and S Suja Priyadharsini, “A strong hybrid adaboost classification algorithm for speaker recognition.,” Sādhanā: Academy Proceedings in Engineering Sciences, vol. 46, no. 3, 2021.
  17. “Autoencoder-Based Semi-Supervised Curriculum Learning for Out-of-Domain Speaker Verification,” in Proc. Interspeech, 2019, pp. 4360–4364.
  18. “Voice conversion based augmentation and a hybrid CNN-LSTM model for improving speaker-independent keyword recognition on limited datasets,” IEEE Access, vol. 10, pp. 89170–89180, 2022.
  19. “Stargan-vc based cross-domain data augmentation for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  20. “Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 316–322.
  21. “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in International Conference on Machine Learning (ICML), 2021, pp. 8599–8608.
  22. “SynAug: Synthesis-based data augmentation for text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5844–5848.
  23. “Unit selection synthesis based data augmentation for fixed phrase speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5849–5853.
  24. “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 11, pp. 8135–8153, 2023.
  25. “Learning with noisy labels,” in Advances in Neural Information Processing Systems, 2013, vol. 26.
  26. “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 29, pp. 132–157, 2020.
  27. “An overview of voice conversion systems,” Speech Communication, vol. 88, pp. 65–82, 2017.
  28. “Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6790–6794.
  29. “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,” in International Conference on Learning Representations (ICLR), 2021.
  30. “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  31. Stochastic differential equations, Springer, 1992.
  32. “ArcFace: Additive angular margin loss for deep face recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4690–4699.
  33. “MUSAN: A music, speech, and noise corpus,” CoRR, vol. abs/1510.08484, 2015.
  34. “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
  35. “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
  36. “Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding,” in Proc. Interspeech, 2019, pp. 406–410.
  37. “Mwmote–majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Transactions on knowledge and data engineering, vol. 26, no. 2, pp. 405–425, 2012.
  38. “Handling imbalanced datasets: A review,” GESTS international transactions on computer science and engineering, vol. 30, no. 1, pp. 25–36, 2006.
  39. “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
  40. “In defence of metric learning for speaker recognition,” in Interspeech, 2020, pp. 2977–2981.
  41. “Pushing the limits of raw waveform speaker recognition,” in Proc. Interspeech, 2022, pp. 2228–2232.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com