Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer (2405.09470v1)

Published 15 May 2024 in cs.SD, cs.CR, cs.LG, and eess.AS

Abstract: In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Adversarial attack on graph structured data. In International conference on machine learning, pages 1115–1124. PMLR, 2018.
  2. An overview of end-to-end automatic speech recognition. Symmetry, 11(8):1018, 2019.
  3. A review of deep learning techniques for speech processing. Information Fusion, page 101869, 2023.
  4. Atieh Poushneh. Humanizing voice assistant: The impact of voice assistant personality on consumers’ attitudes and behaviors. Journal of Retailing and Consumer Services, 58:102283, 2021.
  5. A survey on voice assistant security: Attacks and countermeasures. ACM Computing Surveys, 55(4):1–36, 2022.
  6. Video subtitles. https://www.kapwing.com/resources/subtitle-statistics/.
  7. Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 1121–1134, 2020.
  8. Selective audio adversarial example in evasion attack on speech recognition system. IEEE Transactions on Information Forensics and Security, 15:526–538, 2019.
  9. Black-box adversarial attacks on commercial speech platforms with minimal information. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 86–107, 2021.
  10. {{\{{WaveGuard}}\}}: Understanding and mitigating audio adversarial examples. In 30th USENIX Security Symposium (USENIX Security 21), pages 2273–2290, 2021.
  11. Characterizing audio adversarial examples using temporal dependency. In International Conference on Learning Representations, 2019.
  12. A unified framework for detecting audio adversarial examples. In Proceedings of the 28th ACM International Conference on Multimedia, pages 3986–3994, 2020.
  13. Synthesising audio adversarial examples for automatic speech recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1430–1440, 2022.
  14. {{\{{SMACK}}\}}: Semantically meaningful adversarial audio attack. In 32nd USENIX Security Symposium (USENIX Security 23), pages 3799–3816, 2023.
  15. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  16. Audio style transfer. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 586–590. IEEE, 2018.
  17. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR, 2019.
  18. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pages 7836–7846. PMLR, 2020.
  19. Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6332–6336. IEEE, 2022.
  20. STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. In Proc. Interspeech 2021, pages 4643–4647, 2021.
  21. Fine-grained style control in transformer-based text-to-speech synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7907–7911. IEEE, 2022.
  22. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems, 35:10970–10983, 2022.
  23. Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439, 2022.
  24. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In 2021 IEEE symposium on security and privacy (SP), pages 730–747. IEEE, 2021.
  25. Adversarial example attacks against asr systems: An overview. In 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC), pages 470–477. IEEE, 2022.
  26. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE security and privacy workshops (SPW), pages 1–7. IEEE, 2018.
  27. Targeted adversarial examples for black box audio systems. In 2019 IEEE security and privacy workshops (SPW), pages 15–20. IEEE, 2019.
  28. Generating robust audio adversarial examples with temporal dependency. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3167–3173, 2021.
  29. Robust audio adversarial example for a physical attack. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5334–5341. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  30. Towards robust speech-to-text adversarial attack. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2869–2873. IEEE, 2022.
  31. Real-time neural voice camouflage. In International Conference on Learning Representations, 2022.
  32. Enabling fast and universal audio adversarial attack using generative model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14129–14137, 2021.
  33. Specpatch: Human-in-the-loop adversarial audio spectrogram patch attack on speech recognition. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1353–1366, 2022.
  34. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. 2019.
  35. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In International conference on machine learning, pages 5231–5240. PMLR, 2019.
  36. Practical hidden voice attacks against speech and speaker recognition systems. In 2019 Network and Distributed Systems Security (NDSS) Symposium, 2019.
  37. {{\{{KENKU}}\}}: Towards efficient and stealthy black-box adversarial attacks against {{\{{ASR}}\}} systems. In 32nd USENIX Security Symposium (USENIX Security 23), pages 247–264, 2023.
  38. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
  39. Stylefool: Fooling video classification systems via style transfer. In 2023 IEEE Symposium on Security and Privacy (SP), pages 1631–1648. IEEE, 2023.
  40. Query-efficient black-box adversarial attacks on automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  41. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.
  42. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  43. Li Yujian and Liu Bo. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007.
  44. Weighted-sampling audio adversarial example attack. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4908–4915, 2020.
  45. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182. PMLR, 2016.
  46. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  47. Amazon mechanical turk. https://www.mturk.com.
  48. Rensis Likert. A technique for the measurement of attitudes. Archives of Psychology, 1932.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Weifei Jin (5 papers)
  2. Yuxin Cao (16 papers)
  3. Junjie Su (5 papers)
  4. Qi Shen (41 papers)
  5. Kai Ye (44 papers)
  6. Derui Wang (25 papers)
  7. Jie Hao (44 papers)
  8. Ziyao Liu (22 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com