Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enrollment-stage Backdoor Attacks on Speaker Recognition Systems via Adversarial Ultrasound (2306.16022v2)

Published 28 Jun 2023 in cs.SD, cs.CR, and eess.AS

Abstract: Automatic Speaker Recognition Systems (SRSs) have been widely used in voice applications for personal identification and access control. A typical SRS consists of three stages, i.e., training, enroLLMent, and recognition. Previous work has revealed that SRSs can be bypassed by backdoor attacks at the training stage or by adversarial example attacks at the recognition stage. In this paper, we propose Tuner, a new type of backdoor attack against the enroLLMent stage of SRS via adversarial ultrasound modulation, which is inaudible, synchronization-free, content-independent, and black-box. Our key idea is to first inject the backdoor into the SRS with modulated ultrasound when a legitimate user initiates the enroLLMent, and afterward, the polluted SRS will grant access to both the legitimate user and the adversary with high confidence. Our attack faces a major challenge of unpredictable user articulation at the enroLLMent stage. To overcome this challenge, we generate the ultrasonic backdoor by augmenting the optimization process with random speech content, vocalizing time, and volume of the user. Furthermore, to achieve real-world robustness, we improve the ultrasonic signal over traditional methods using sparse frequency points, pre-compensation, and single-sideband (SSB) modulation. We extensively evaluate Tuner on two common datasets and seven representative SRS models, as well as its robustness against seven kinds of defenses. Results show that our attack can successfully bypass speaker recognition systems while remaining effective to various speakers, speech content, etc. To mitigate this newly discovered threat, we also provide discussions on potential countermeasures, limitations, and future works of this new threat.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. D. F. Ben Gran, “How banking virtual assistants can improve your banking experience,” Website, 2022, https://www.forbes.com/advisor/banking/banking-virtual-assistants/.
  2. S. Wang, J. Cao, X. He, K. Sun, and Q. Li, “When the differences in frequency domain are compensated: Understanding and defeating modulated replay attacks on automatic speech recognition,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 1103–1119.
  3. L. Blue, L. Vargas, and P. Traynor, “Hello, is it me you’re looking for? differentiating between human and electronic speakers for voice interface security,” in Proceedings of the 11th ACM Conference on Security & Privacy in Wireless and Mobile Networks, 2018, pp. 123–133.
  4. T. Zhai, Y. Li, Z. Zhang, B. Wu, Y. Jiang, and S.-T. Xia, “Backdoor attack against speaker verification,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 2560–2564.
  5. T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, vol. 1, 2017.
  6. C. Shi, T. Zhang, Z. Li, H. Phan, T. Zhao, Y. Wang, J. Liu, B. Yuan, and Y. Chen, “Audio-domain position-independent backdoor attack via unnoticeable triggers,” in Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, 2022, pp. 583–595.
  7. Y. Luo, J. Tai, X. Jia, and S. Zhang, “Practical backdoor attack against speaker recognition system,” in Information Security Practice and Experience: 17th International Conference, ISPEC 2022, Taipei, Taiwan, November 23–25, 2022, Proceedings.   Springer, 2022, pp. 468–484.
  8. G. Chen, S. Chenb, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Who is real bob? adversarial attacks on speaker recognition systems,” in 2021 IEEE Symposium on Security and Privacy (SP).   IEEE, 2021, pp. 694–711.
  9. Z. Li, Y. Wu, J. Liu, Y. Chen, and B. Yuan, “Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 1121–1134.
  10. Z. Li, C. Shi, Y. Xie, J. Liu, B. Yuan, and Y. Chen, “Practical adversarial attacks against speaker recognition systems,” in Proceedings of the 21st international workshop on mobile computing systems and applications, 2020, pp. 9–14.
  11. G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu, “Dolphinattack: Inaudible voice commands,” in Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, 2017, pp. 103–117.
  12. H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse coding algorithms,” Advances in neural information processing systems, vol. 19, 2006.
  13. Y. Zeng, W. Park, Z. M. Mao, and R. Jia, “Rethinking the backdoor attacks’ triggers: A frequency perspective,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 473–16 481.
  14. D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber, “Natural evolution strategies,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 949–980, 2014.
  15. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, 2020, pp. 3830–3834.
  16. H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neural building blocks for speaker diarization,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7124–7128.
  17. W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5791–5795.
  18. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  19. N. R. Koluguri, J. Li, V. Lavrukhin, and B. Ginsburg, “Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification,” arXiv preprint arXiv:2010.12653, 2020.
  20. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4879–4883.
  21. J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Interspeech, 2020.
  22. A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, vol. 1, 2017.
  23. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  24. M. Sood and S. Jain, “Speech recognition employing mfcc and dynamic time warping algorithm,” in Innovations in Information and Communication Technologies (IICT-2020).   Springer, 2021, pp. 235–242.
  25. N. P. H. Thian, C. Sanderson, and S. Bengio, “Spectral subband centroids as complementary features for speaker authentication,” in International conference on biometric authentication.   Springer, 2004, pp. 631–639.
  26. H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
  27. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
  28. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2018, pp. 5329–5333.
  29. R. He, X. Ji, X. Li, Y. Cheng, and W. Xu, ““ok, siri” or “hey, google”: Evaluating voiceprint distinctiveness via content-based prole score,” in Proceedings of the 31th USENIX Security Symposium, 2022.
  30. S. Furui, “Speaker recognition,” http://www.scholarpedia.org/article/Speaker_recognition, 2008.
  31. N. Roy, S. Shen, H. Hassanieh, and R. R. Choudhury, “Inaudible voice commands: The long-range attack and defense,” in Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 547–560.
  32. X. Ji, J. Zhang, S. Jiang, J. Li, and W. Xu, “Capspeaker: Injecting voices to microphones via capacitors,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021.
  33. Q. Yan, K. Liu, Q. Zhou, H. Guo, and N. Zhang, “Surfingattack: Interactive hidden attack on voice assistants using ultrasonic guided waves,” in Proceedings of the Network and Distributed Systems Security (NDSS) Symposium, 2020.
  34. T. Sugawara, B. Cyr, S. Rampazzi, D. Genkin, and K. Fu, “Light commands: laser-based audio injection attacks on voice-controllable systems,” in Proceedings of the 29th USENIX Security Symposium (USENIX Security 20), 2020, pp. 2631–2648.
  35. Y. Wang, H. Guo, and Q. Yan, “Ghosttalk: Interactive attack on smartphone voice system through power line,” in Network and Distributed Systems Security (NDSS) Symposium, 2022.
  36. H. Ai, Y. Wang, Y. Yang, and Q. Zhang, “An improvement of the degradation of speaker recognition in continuous cold speech for home assistant,” in International Symposium on Cyberspace Safety and Security.   Springer, 2019, pp. 363–373.
  37. B. O’Brien, C. Meunier, and A. Ghio, “Evaluating the effects of modified speech on perceptual speaker identification performance,” in Interspeech 2022, 2022.
  38. R. G. Tull and J. C. Rutledge, “Analysis of “cold-affected” speech for inclusion in speaker recognition systems.” The Journal of the Acoustical Society of America, vol. 99, no. 4, pp. 2549–2574, 1996.
  39. J. Wagner, T. Fraga-Silva, Y. Josse, D. Schiller, A. Seiderer, and E. André, “Infected phonemes: How a cold impairs speech on a phonetic level,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda, Ed.   ISCA, 2017, pp. 3457–3461. [Online]. Available: http://www.isca-speech.org/archive/Interspeech\_2017/abstracts/1066.html
  40. L. Zheng, J. Li, M. Sun, X. Zhang, and T. F. Zheng, “When automatic voice disguise meets automatic speaker verification,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 824–837, 2020.
  41. L. Tavi, T. Kinnunen, and R. G. Hautamäki, “Improving speaker de-identification with functional data analysis of f0 trajectories,” Speech Communication, vol. 140, pp. 1–10, 2022.
  42. P. Manocha, A. Finkelstein, R. Zhang, N. J. Bryan, G. J. Mysore, and Z. Jin, “A differentiable perceptual audio metric learned from just noticeable differences,” arXiv preprint arXiv:2001.04460, 2020.
  43. R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  44. R. E. Berg, “Sound Physics,” https://www.britannica.com/science/sound-physics, 2019.
  45. G. Zhang, X. Ji, X. Li, G. Qu, and W. Xu, “Eararray: Defending against dolphinattack via acoustic attenuation,” in Network and Distributed System Security (NDSS) Symposium, 2021.
  46. M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in 2009 16th International Conference on Digital Signal Processing.   IEEE, 2009, pp. 1–5.
  47. J. Deng, Y. Chen, and W. Xu, “Fencesitter: Black-box, content-agnostic, and synchronization-free enrollment-phase attacks on speaker recognition systems,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2022.
  48. X. Li, X. Ji, C. Yan, C. Li, Y. Li, Z. Zhang, and W. Xu, “Learning normality is enough: A software-based mitigation against the inaudible voice attacks,” in Proceedings of the 32nd USENIX Security Symposium, 2023.
  49. P. Huang, Y. Wei, P. Cheng, Z. Ba, L. Lu, F. Lin, F. Zhang, and K. Ren, “Infomasker: Preventing eavesdropping using phoneme-based noise,” in Network and Distributed System Security (NDSS) Symposium, 2023.
  50. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, vol. 1, 2014.
  51. K. Technologies, “EXG X-Series Signal Generator,” https://www.keysight.com/us/en/assets/7018-03381/data-sheets/5991-0039.pdf, 2019.
  52. Micronix, “Nf hsa4015,” Website, 2013, https://eshop.micronix.eu/measurement-equipment/electrical-quantities/nf-corporation-instruments/high-speed-bipolar-amplifiers/hsa-4051.html.
  53. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proceedings of the Oriental COCOSDA 2017, 2017, p. Submitted.
  54. V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” ArXiv, vol. abs/2012.03411, 2020.
  55. H. Zeinali, K. A. Lee, J. Alam, and L. Burget, “Short-duration speaker verification (sdsv) challenge 2021: the challenge evaluation plan,” arXiv preprint arXiv:1912.06311, vol. 1, 2019.
  56. J. Thienpondt, B. Desplanques, and K. Demuynck, “Cross-lingual speaker verification with domain-balanced hard prototype mining and language-dependent score normalization,” arXiv preprint arXiv:2007.07689, 2020.
  57. Q. Jin, T. Schultz, and A. Waibel, “Far-field speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2023–2032, 2007.
  58. A. Gusev, V. Volokhov, T. Andzhukaev, S. Novoselov, G. Lavrentyeva, M. Volkova, A. Gazizullina, A. Shulipa, A. Gorlanov, A. Avdeeva et al., “Deep speaker embeddings for far-field speaker recognition on short utterances,” arXiv preprint arXiv:2002.06033, 2020.
  59. R. Iijima, S. Minami, Z. Yunao, T. Takehisa, T. Takahashi, Y. Oikawa, and T. Mori, “Audio hotspot attack: An attack on voice assistance systems using directional sound beams,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp. 2222–2224.
  60. M. I. Mandasari, M. McLaren, and D. A. van Leeuwen, “The effect of noise on modern automatic speaker recognition systems,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2012, pp. 4249–4252.
  61. freesound, “freesound.org,” https://freesound.org/, 2022.
  62. X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter, “Commandersong: A systematic approach for practical adversarial voice recognition,” in 27th USENIX Security Symposium, USENIX Security, 2018, pp. 49–64.
  63. X. Li, C. Yan, X. Lu, X. Ji, and W. Xu, “Inaudible adversarial perturbation: Manipulating the recognition of user speech in real time,” in Network and Distributed System Security (NDSS) Symposium, 2024.
  64. Y. He, J. Bian, X. Tong, Z. Qian, W. Zhu, X. Tian, and X. Wang, “Canceling inaudible voice commands against voice control systems,” in Proceedings of the 25th Annual International Conference on Mobile Computing and Networking, 2019, pp. 1–15.
  65. L. Lu, J. Yu, Y. Chen, H. Liu, Y. Zhu, Y. Liu, and M. Li, “Lippass: Lip reading-based user authentication on smartphones leveraging acoustic signals,” in IEEE INFOCOM 2018-IEEE Conference on Computer Communications.   IEEE, 2018, pp. 1466–1474.
  66. C. Yan, Y. Long, X. Ji, and W. Xu, “The catcher in the field: A fieldprint based spoofing detection for text-independent speaker verification,” in Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2019, pp. 1215–1229.
  67. X. Li, Z. Zheng, C. Yan, C. Li, X. Ji, and W. Xu, “Towards pitch-insensitive speaker verification via soundfield,” IEEE Internet of Things Journal, 2023.
  68. S. Koffas, J. Xu, M. Conti, and S. Picek, “Can you hear it? backdoor attacks via ultrasonic triggers,” arXiv preprint arXiv:2107.14569, vol. 1, 2021.
  69. Z. Ye, D. Yan, L. Dong, J. Deng, and S. Yu, “Stealthy backdoor attack against speaker recognition using phase-injection hidden trigger,” IEEE Signal Processing Letters, 2023.
  70. H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. R. B. Butler, and J. Wilson, “Practical hidden voice attacks against speech and speaker recognition systems,” in 26th Annual Network and Distributed System Security Symposium, NDSS 2019, 2019.
  71. W. Zhang, S. Zhao, L. Liu, J. Li, X. Cheng, T. F. Zheng, and X. Hu, “Attack on practical speaker verification system using universal adversarial perturbations,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 2575–2579.
  72. Y. Xie, Z. Li, C. Shi, J. Liu, Y. Chen, and B. Yuan, “Enabling fast and universal audio adversarial attack using generative model,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 14 129–14 137.
  73. G. Chen, S. Chen, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Who is real bob? adversarial attacks on speaker recognition systems,” in 42nd IEEE Symposium on Security and Privacy, SP 2021.   IEEE, 2021, pp. 694–711.
  74. Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 1, 2022.
  75. Z. Zheng, X. Li, C. Yan, X. Ji, and W. Xu, “The silent manipulator: A practical and inaudible backdoor attack against speech recognition systems,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7849–7858.
  76. N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” in 2018 IEEE security and privacy workshops (SPW).   IEEE, 2018, pp. 1–7.
  77. C. Yan, G. Zhang, X. Ji, T. Zhang, T. Zhang, and W. Xu, “The feasibility of injecting inaudible voice commands to voice assistants,” IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 3, pp. 1108–1124, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinfeng Li (38 papers)
  2. Junning Ze (1 paper)
  3. Chen Yan (25 papers)
  4. Yushi Cheng (5 papers)
  5. Xiaoyu Ji (19 papers)
  6. Wenyuan Xu (35 papers)
Citations (5)