Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification (2211.00825v2)

Published 2 Nov 2022 in eess.AS, cs.LG, and cs.SD

Abstract: Although the security of automatic speaker verification (ASV) is seriously threatened by recently emerged adversarial attacks, there have been some countermeasures to alleviate the threat. However, many defense approaches not only require the prior knowledge of the attackers but also possess weak interpretability. To address this issue, in this paper, we propose an attacker-independent and interpretable method, named learnable mask detector (LMD), to separate adversarial examples from the genuine ones. It utilizes score variation as an indicator to detect adversarial examples, where the score variation is the absolute discrepancy between the ASV scores of an original audio recording and its transformed audio synthesized from its masked complex spectrogram. A core component of the score variation detector is to generate the masked spectrogram by a neural network. The neural network needs only genuine examples for training, which makes it an attacker-independent approach. Its interpretability lies that the neural network is trained to minimize the score variation of the targeted ASV, and maximize the number of the masked spectrogram bins of the genuine training examples. Its foundation is based on the observation that, masking out the vast majority of the spectrogram bins with little speaker information will inevitably introduce a large score variation to the adversarial example, and a small score variation to the genuine example. Experimental results with 12 attackers and two representative ASV systems show that our proposed method outperforms five state-of-the-art baselines. The extensive experimental results can also be a benchmark for the detection-based ASV defenses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Z. Bai and X.-L. Zhang, “Speaker recognition based on deep learning: An overview,” Neural Networks, vol. 140, pp. 65–99, 2021.
  2. E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2014, pp. 4052–4056.
  3. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 5329–5333.
  4. Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker verification,” in Proc. Interspeech, 2019, pp. 2873–2877.
  5. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 4879–4883.
  6. Z. Bai, X.-L. Zhang, and J. Chen, “Partial AUC optimization based deep speaker embeddings with class-center learning for text-independent speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2020, pp. 6819–6823.
  7. J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Proc. Interspeech, 2020, pp. 2977–2981.
  8. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in arXiv e-prints, 2013. [Online]. Available: https://arxiv.org/abs/1312.6199
  9. J. Villalba, Y. Zhang, and N. Dehak, “X-vectors meet adversarial attacks: Benchmarking adversarial robustness in speaker verification.” in Proc. Interspeech, 2020, pp. 4233–4237.
  10. J. Lan, R. Zhang, Z. Yan, J. Wang, Y. Chen, and R. Hou, “Adversarial attacks and defenses in speaker recognition systems: A survey,” Journal of Systems Architecture, vol. 127, p. 102526, 2022.
  11. H. Tan, L. Wang, H. Zhang, J. Zhang, M. Shafiq, and Z. Gu, “Adversarial attack and defense strategies of speaker recognition systems: A survey,” Electronics, vol. 11, no. 14, p. 2183, 2022.
  12. Y. Xie, Z. Li, C. Shi, J. Liu, Y. Chen, and B. Yuan, “Enabling fast and universal audio adversarial attack using generative model,” in Proc. Association for the Advancement of Artificial Intelligence (AAAI), vol. 35, no. 16, 2021, pp. 14 129–14 137.
  13. G. Chen, S. Chenb, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Who is real bob? adversarial attacks on speaker recognition systems,” in Proc. IEEE Symposium on Security and Privacy (SP), 2021, pp. 694–711.
  14. F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling end-to-end speaker verification with adversarial examples,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 1962–1966.
  15. X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, “Adversarial attacks on GMM i-vector based speaker verification systems,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2020, pp. 6579–6583.
  16. Q. Wang, P. Guo, and L. Xie, “Inaudible adversarial perturbations for targeted attack in speaker recognition,” in Proc. Interspeech, 2020, pp. 4228–4232.
  17. Z. Li, Y. Wu, J. Liu, Y. Chen, and B. Yuan, “AdvPulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations,” in Proc. ACM SIGSAC Conference on Computer and Communications Security (CCS), 2020, pp. 1121–1134.
  18. W. Zhang, S. Zhao, L. Liu, J. Li, X. Cheng, T. F. Zheng, and X. Hu, “Attack on practical speaker verification system using universal adversarial perturbations,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2021, pp. 2575–2579.
  19. Y. Xie, Z. Li, C. Shi, J. Liu, Y. Chen, and B. Yuan, “Real-time, robust and adaptive universal adversarial attacks against speaker recognition systems,” Journal of Signal Processing Systems, vol. 93, no. 10, pp. 1187–1200, 2021.
  20. X. Zhang, X. Zhang, X. Zou, H. Liu, and M. Sun, “Towards generating adversarial examples on combined systems of automatic speaker verification and spoofing countermeasure,” Security and Communication Networks, vol. 2022, 2022.
  21. Q. Wang, P. Guo, S. Sun, L. Xie, and J. H. Hansen, “Adversarial regularization for end-to-end robust speaker verification.” in Proc. Interspeech, 2019, pp. 4010–4014.
  22. H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against adversarial attacks on spoofing countermeasures of ASV,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2020, pp. 6564–6568.
  23. M. Pal, A. Jati, R. Peri, C.-C. Hsu, W. AbdAlmageed, and S. Narayanan, “Adversarial defense for deep speaker recognition using hybrid adversarial training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2021, pp. 6164–6168.
  24. I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in arXiv e-prints, 2014. [Online]. Available: https://arxiv.org/abs/1412.6572
  25. H. Zhang, L. Wang, Y. Zhang, M. Liu, K. A. Lee, and J. Wei, “Adversarial separation network for speaker recognition.” in Proc. Interspeech, 2020, pp. 951–955.
  26. S. Joshi, J. Villalba, P. Żelasko, L. Moro-Velázquez, and N. Dehak, “Study of pre-processing defenses against adversarial attacks on state-of-the-art speaker recognition systems,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 4811–4826, 2021.
  27. H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-Y. Lee, “Improving the adversarial robustness for speaker verification by self-supervised learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing., vol. 30, pp. 202–217, 2021.
  28. H. Wu, Y. Zhang, Z. Wu, D. Wang, and H.-y. Lee, “Voting for the right answer: Adversarial defense for speaker verification,” in Proc. Interspeech, 2021, pp. 4294–4298.
  29. X. Li, N. Li, J. Zhong, X. Wu, X. Liu, D. Su, D. Yu, and H. Meng, “Investigating robustness of adversarial samples detection for automatic speaker verification,” in Proc. Interspeech, 2020, pp. 1540–1544.
  30. S. Joshi, S. Kataria, J. Villalba, and N. Dehak, “AdvEst: Adversarial perturbation estimation to classify and detect adversarial attacks against speaker identification,” in arXiv e-prints, 2022. [Online]. Available: https://arxiv.org/abs/2204.03848
  31. Z. Peng, X. Li, and T. Lee, “Pairing weak with strong: Twin models for defending against adversarial attack on speaker verification,” in Proc. Interspeech, 2021, pp. 4284–4288.
  32. H. Wu, P.-C. Hsu, J. Gao, S. Zhang, S. Huang, J. Kang, Z. Wu, H. Meng, and H.-Y. Lee, “Adversarial sample detection for speaker verification by neural vocoders,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2022, pp. 236–240.
  33. X. Chen, J. Yao, and X.-L. Zhang, “Masking speech feature to detect adversarial examples for speaker verification,” in Proc. IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, “to be published”.
  34. A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” in arXiv e-prints, 2016. [Online]. Available: https://arxiv.org/abs/1611.01236
  35. N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in Proc. IEEE Symposium on Security and Privacy (SP), 2017, pp. 39–57.
  36. A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
  37. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
  38. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
  39. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Proc. Interspeech, 2020, pp. 2472–2476.
  40. A. Nagrani, J. S. Chung, J. Huh, A. Brown, E. Coto, W. Xie, M. McLaren, D. A. Reynolds, and A. Zisserman, “VoxSRC 2020: The second VoxCeleb speaker recognition challenge,” in arXiv e-prints, 2020. [Online]. Available: https://arxiv.org/abs/2012.06867
  41. D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” in arXiv e-prints, 2015. [Online]. Available: https://arxiv.org/abs/1510.08484
  42. N. Brummer, L. Burget, J. Cernocky, O. Glembek, F. Grezl, M. Karafiat, D. A. Van Leeuwen, P. Matejka, P. Schwarz, and A. Strasheim, “Fusion of heterogeneous speaker recognition systems in the stbu submission for the nist speaker recognition evaluation 2006,” IEEE Transactions on Audio, Speech, and Language Processing., vol. 15, no. 7, pp. 2072–2084, 2007.
  43. N. Brümmer and E. De Villiers, “The BOSARIS toolkit: Theory, algorithms and code for surviving the new DCF,” in arXiv e-prints, 2013. [Online]. Available: https://arxiv.org/abs/1304.2865
Citations (7)

Summary

We haven't generated a summary for this paper yet.