Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio (2401.02626v1)

Published 5 Jan 2024 in cs.SD and eess.AS

Abstract: Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is based on the property that the gradient indicates which parts of the input the model is paying attention to. Specifically, when the speaker network focuses on a region in the denoised utterance but not on the clean counterpart, we consider it artifact noise and assign higher weights for this region during optimization of enhancement. We validate it by training an enhancement model and testing the enhanced utterance on speaker verification. The experimental results show that our approach effectively reduces artifact noise, improving speaker verification across various SNR levels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Voiceid loss: Speech enhancement for speaker verification,” arXiv preprint arXiv:1904.03601, 2019.
  2. “Feature enhancement with deep feature losses for speaker verification,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7584–7588.
  3. “PL-EESR: Perceptual loss based end-to-end robust speaker representation extraction,” arXiv preprint arXiv:2110.00940, 2021.
  4. “Disentangling voice and content with self-supervision for speaker recognition,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  5. “MFA: Tdnn with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7517–7521.
  6. “CFTNet: Complex-valued Frequency Transformation Network for Speech Enhancement,” in Proc. INTERSPEECH 2023, 2023, pp. 809–813.
  7. H. Wang and D. Wang, “Cross-domain diffusion based speech enhancement for very noisy speech,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  8. “UNetGAN: A Robust Speech Enhancement Approach in Time Domain for Extremely Low Signal-to-Noise Ratio Condition,” in Proc. Interspeech 2019, 2019, pp. 1786–1790.
  9. “Towards robust speaker verification with target speaker enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6693–6697.
  10. “Extended U-NET for speaker verification in noisy environments,” arXiv preprint arXiv:2206.13044, 2022.
  11. “Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network,” in Proc. Interspeech 2021, 2021, pp. 1089–1093.
  12. “Analysis of deep feature loss based enhancement for speaker verification,” arXiv preprint arXiv:2002.00139, 2020.
  13. “Axiomatic attribution for deep networks,” in International conference on machine learning. PMLR, 2017, pp. 3319–3328.
  14. “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626.
  15. “LayerCAM: Exploring hierarchical class activation maps for localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021.
  16. “Do perceptually aligned gradients imply adversarial robustness?,” arXiv preprint arXiv:2207.11378, 2022.
  17. “Proper network interpretability helps adversarial robustness in classification,” in International Conference on Machine Learning. PMLR, 2020, pp. 1014–1023.
  18. “Deep Model Reassembly,” in Advances in neural information processing systems, 2022.
  19. “LLM-Pruner: On the Structural Pruning of Large Language Models,” in Advances in neural information processing systems, 2023.
  20. “Deepcache: Accelerating diffusion models for free,” arXiv preprint arXiv:2312.00858, 2023.
  21. “Dataset Distillation via Factorization,” in Advances in neural information processing systems, 2022.
  22. “Reliable Visualization for Deep Speaker Recognition,” in Proc. Interspeech 2022, 2022, pp. 331–335.
  23. “Visualizing data augmentation in deep speaker recognition,” arXiv preprint arXiv:2305.16070, 2023.
  24. P. Dabkowski and Y. Gal, “Real time image saliency for black box classifiers,” Advances in neural information processing systems, vol. 30, 2017.
  25. “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
  26. “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  27. “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home,” 2017.
  28. “The INTERSPEECH 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
  29. “Wespeaker: A research and production oriented speaker embedding learning toolkit,” arXiv preprint arXiv:2210.17016, 2022.

Summary

We haven't generated a summary for this paper yet.