Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions (2312.16613v2)

Published 27 Dec 2023 in cs.SD, cs.LG, and eess.AS

Abstract: In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD. The trained models are systematically evaluated on both clean speech and speech contaminated by various types of noise at different SNR-levels and compared to a purely supervised model. Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “rvad: An unsupervised segment-based robust voice activity detection method,” Computer Speech & Language, vol. 59, pp. 1–21, 2020.
  2. “Voice activity detection in the wild: A data-driven approach using teacher-student training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1542–1555, 2021.
  3. “Personal VAD: Speaker-Conditioned Voice Activity Detection,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 2020, pp. 433–439.
  4. “Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,” in Proc. Interspeech 2020, 2020.
  5. “Target-speaker voice activity detection with improved i-vector estimation for unknown number of speaker,” in Proc. Interspeech 2021, 2021, pp. 2523–2527.
  6. “Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 3744–3748.
  7. “Comparison of forced-alignment speech recognition and humans for generating reference VAD,” in Proc. Interspeech 2015, 2015, pp. 2937–2941.
  8. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  9. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  10. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021.
  11. “Self-supervised learning is more robust to dataset imbalance,” in International Conference on Learning Representations, 2022.
  12. “Understanding the robustness of self-supervised learning through topic modeling,” in International Conference on Learning Representations, 2023.
  13. “Cross-domain voice activity detection with self-supervised representations,” ArXiv, vol. abs/2209.11061, 2022.
  14. “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4704–4708.
  15. “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
  16. “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, pp. 101027, 2020.
  17. “An unsupervised autoregressive model for speech representation learning,” in Proc. Interspeech 2019, pp. 146–150.
  18. “Autoregressive predictive coding: A comprehensive study,” IEEE Journal of Selected Topics in Signal Processing, pp. 1–12, 2022.
  19. “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  20. “Libri-light: A benchmark for asr with limited or no supervision,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7669–7673.
  21. “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Proc. Interspeech 2017, 2017, pp. 498–502.
  22. “Generative pre-training for speech with autoregressive predictive coding,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3497–3501.
  23. “Speech enhancement using long short-term memory based recurrent neural networks for noise robust speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 305–311.
  24. “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
Citations (2)

Summary

We haven't generated a summary for this paper yet.