Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification (2305.02147v3)

Published 3 May 2023 in eess.AS and cs.HC

Abstract: Despite the maturity of modern speaker verification technology, its performance still significantly degrades when facing non-neutrally-phonated (e.g., shouted and whispered) speech. To address this issue, in this paper, we propose a new speaker embedding compensation method based on a minimum mean square error (MMSE) estimator. This method models the joint distribution of the vocal effort transfer vector and non-neutrally-phonated embedding spaces and operates in a principal component analysis domain to cope with non-neutrally-phonated speech data scarcity. Experiments are carried out using a cutting-edge speaker verification system integrating a powerful self-supervised pre-trained model for speech representation. In comparison with a state-of-the-art embedding compensation method, the proposed MMSE estimator yields superior and competitive equal error rate results when tackling shouted and whispered speech, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of the International Speech Communication Association, October 25-29, Shanghai, China, 2020, pp. 3830–3834.
  2. “Large-scale self-supervised speech representation learning for automatic speaker verification,” in Proceedings of ICASSP 2022 – 47th International Conference on Acoustics, Speech, and Signal Processing, May 23-27, Singapore, 2022, pp. 6147–6151.
  3. “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2022.
  4. “Shouted speech compensation for speaker verification robust to vocal effort conditions,” in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of the International Speech Communication Association, October 25-29, Shanghai, China, 2020, pp. 1511–1515.
  5. “Shouted and whispered speech compensation for speaker verification systems,” Digital Signal Processing, vol. 127, 2022.
  6. “Cepstral vector normalization based on stereo data for robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1098–1113, 2007.
  7. José A. González López, Reconocimiento robusto de voz con datos perdidos o inciertos, Ph.D. thesis, University of Granada, 2013.
  8. H. Liao and M.J.F. Gales, “Issues with uncertainty decoding for noise robust automatic speech recognition,” Speech Communication, vol. 50, pp. 265–277, 2008.
  9. “Stereo-based stochastic mapping for robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, pp. 1325–1334, 2009.
  10. Probability, Random Variables and Stochastic Processes (4th Edition), McGraw-Hill Europe, 2002.
  11. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  12. “Speaker identification from shouted speech: Analysis and compensation,” in Proceedings of ICASSP 2013 – 38th International Conference on Acoustics, Speech, and Signal Processing, May 26-31, Vancouver, Canada, 2013, pp. 8027–8031.
  13. “The CHAINS corpus: CHAracterizing INdividual Speakers,” in Proceedings of SPECOM, 2006, pp. 431–435.
  14. “ArcFace: Additive angular margin loss for deep face recognition,” in Proceedings of CVPR 2019 – IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 15-20, Long Beach, USA, 2019, pp. 4685–4694.
  15. “VoxCeleb2: Deep speaker recognition,” in Proceedings of INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2-6, Hyderabad, India, 2018, pp. 1086–1090.
  16. “The Idlab Voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification,” in Proceedings of ICASSP 2021 – 46th International Conference on Acoustics, Speech, and Signal Processing, June 6-11, Toronto, Canada, 2021, pp. 5814–5818.
  17. S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, pp. 357–366, 1980.

Summary

We haven't generated a summary for this paper yet.