Diff-SV: A Unified Hierarchical Framework for Noise-Robust Speaker Verification Using Score-Based Diffusion Probabilistic Models (2309.08320v2)
Abstract: Background noise considerably reduces the accuracy and reliability of speaker verification (SV) systems. These challenges can be addressed using a speech enhancement system as a front-end module. Recently, diffusion probabilistic models (DPMs) have exhibited remarkable noise-compensation capabilities in the speech enhancement domain. Building on this success, we propose Diff-SV, a noise-robust SV framework that leverages DPM. Diff-SV unifies a DPM-based speech enhancement system with a speaker embedding extractor, and yields a discriminative and noise-tolerable speaker representation through a hierarchical structure. The proposed model was evaluated under both in-domain and out-of-domain noisy conditions using the VoxCeleb1 test set, an external noise source, and the VOiCES corpus. The obtained experimental results demonstrate that Diff-SV achieves state-of-the-art performance, outperforming recently proposed noise-robust SV systems.
- “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2014, pp. 4052–4056.
- “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333.
- “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in arXiv preprint arXiv:2005.07143, 2020.
- “Rawnext: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7647–7651.
- Distant speech recognition, John Wiley & Sons, 2009.
- “Within-sample variability-invariant loss for robust speaker recognition under noisy environments,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6469–6473.
- “Audio enhancing with dnn autoencoder for speaker recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5090–5094.
- “Analysis of dnn speech signal enhancement for robust speaker recognition,” in Computer Speech & Language. 2019, vol. 58, pp. 403–421, Elsevier.
- “Extended u-net for speaker verification in noisy environments,” in INTERSPEECH, 2022, pp. 590–594.
- “A recurrent variational autoencoder for speech enhancement,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 371–375.
- “Time-frequency masking-based speech enhancement using generative adversarial network,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5039–5043.
- “A flow-based neural network for time domain speech enhancement,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5754–5758.
- “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML. PMLR, 2015, pp. 2256–2265.
- “Diffusion models beat gans on image synthesis,” in Advances in neural information processing systems, 2021, vol. 34, pp. 8780–8794.
- “Denoising diffusion probabilistic models,” in Advances in neural information processing systems, 2020, vol. 33, pp. 6840–6851.
- “A study on speech enhancement based on diffusion probabilistic model,” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 659–666.
- “Conditional diffusion probabilistic model for speech enhancement,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7402–7406.
- “Denoising diffusion implicit models,” in ICLR, 2021.
- “Score-based generative modeling through stochastic differential equations,” in ICLR, 2020.
- “Voxceleb: A large-scale speaker identification dataset,” in INTERSPEECH. 2017, pp. 2616–2620, ISCA.
- “Musan: A music, speech, and noise corpus,” in arXiv preprint arXiv:1510.08484, 2015.
- “Joint feature enhancement and speaker recognition with multi-objective task-oriented network,” in INTERSPEECH, 2021, pp. 1089–1093.
- “Voiceid loss: Speech enhancement for speaker verification,” in INTERSPEECH, 2019, pp. 2888–2892.
- “Noise-disentanglement metric learning for robust speaker verification,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Grad-tts: A diffusion probabilistic model for text-to-speech,” in ICML. PMLR, 2021, pp. 8599–8608.
- Diganta Misra, “Mish: A self regularized non-monotonic activation function,” in BMVC. 2020, BMVA Press.
- “Attention is all you need,” in Advances in neural information processing systems, 2017, vol. 30.
- “Arcface: Additive angular margin loss for deep face recognition,” in CVPR, 2019, pp. 4690–4699.
- “A tandem algorithm for pitch estimation and voiced speech segregation,” in IEEE Transactions on Audio, Speech, and Language Processing. 2010, vol. 18, pp. 2067–2079, IEEE.
- “Voices Obscured in Complex Environmental Settings (VOiCES) Corpus,” in INTERSPEECH, 2018, pp. 1566–1570.
- “On the convergence of adam and beyond,” in ICLR, 2018.
- Ju-ho Kim (19 papers)
- Jungwoo Heo (12 papers)
- Hyun-seo Shin (8 papers)
- Chan-yeong Lim (7 papers)
- Ha-Jin Yu (35 papers)