Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models (2405.06134v2)

Published 9 May 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate special tokens' in their vocabulary, such as $\texttt{<|endoftext|>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectivelymuting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Practical hidden voice attacks against speech and speaker recognition systems.
  2. Did you hear that? adversarial examples against automatic speech recognition. CoRR, abs/1801.00554.
  3. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222.
  4. The MGB challenge: Evaluating multi-genre broadcast media recognition. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 687–693. IEEE.
  5. Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), pages 513–530, Austin, TX. USENIX Association.
  6. Nicholas Carlini and David A. Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. CoRR, abs/1801.01944.
  7. Devil’s whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices. In 29th USENIX Security Symposium (USENIX Security 20), pages 2667–2684. USENIX Association.
  8. Uniap: Protecting speech privacy with non-targeted universal adversarial perturbations. IEEE Transactions on Dependable and Secure Computing, 21(01):31–46.
  9. Houdini: Fooling deep structured prediction models.
  10. Fleurs: Few-shot learning evaluation of universal representations of speech. arXiv preprint arXiv:2205.12446.
  11. ADAGIO: interactive experimentation with adversarial attack and defense for audio. CoRR, abs/1805.11852.
  12. Sirenattack: Generating adversarial audio for end-to-end acoustic systems.
  13. A practical black-box attack against autonomous speech recognition model. In GLOBECOM 2020 - 2020 IEEE Global Communications Conference, pages 1–6.
  14. Yuan Gong and Christian Poellabauer. 2017. Crafting adversarial examples for speech paralinguistics applications. CoRR, abs/1711.03280.
  15. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer.
  16. Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization.
  17. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA.
  18. Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, page 1121–1134, New York, NY, USA. Association for Computing Machinery.
  19. Exploring targeted universal adversarial perturbations to end-to-end asr models.
  20. Simulating unknown target models for query-efficient black-box attacks.
  21. Hate speech detection: Challenges and solutions. PloS one, 14(8):e0221152.
  22. Towards deep learning models resistant to adversarial attacks.
  23. Artie bias corpus: An open dataset for detecting demographic bias in speech applications. In Proceedings of the twelfth language resources and evaluation conference, pages 6462–6468.
  24. Universal adversarial perturbations for speech recognition systems. CoRR, abs/1905.03828.
  25. Raphael Olivier and Bhiksha Raj. 2023. There is more than one kind of robustness: Fooling whisper with adversarial examples.
  26. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
  27. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition.
  28. Robust speech recognition via large-scale weak supervision.
  29. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  30. Universal adversarial attacks on spoken language assessment systems. Interspeech.
  31. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. ArXiv, abs/1808.05665.
  32. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding.
  33. Targeted adversarial examples for black box audio systems.
  34. Ching Seh Wu and Unnathi Bhandary. 2020. Detection of hate speech in videos using machine learning. In 2020 International Conference on Computational Science and Computational Intelligence (CSCI), pages 585–590. IEEE.
  35. Commandersong: A systematic approach for practical adversarial voice recognition.
  36. Dolphinatack: Inaudible voice commands. CoRR, abs/1708.09537.
  37. Black-box adversarial attacks on commercial speech platforms with minimal information. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21. ACM.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vyas Raina (18 papers)
  2. Rao Ma (22 papers)
  3. Charles McGhee (2 papers)
  4. Kate Knill (11 papers)
  5. Mark Gales (52 papers)
Citations (1)

Summary

  • The paper demonstrates that appending a 0.64-second adversarial audio segment reliably mutes Whisper models with over 97% success.
  • The paper reveals that leveraging Whisper’s special <endoftext> token exposes key vulnerabilities in automatic speech recognition.
  • The paper highlights dual implications, suggesting risks in security loopholes and benefits for protecting confidential communications.

Exploring Vulnerabilities in Whisper Through Universal Adversarial Audio Attacks

Overview of Whisper and ASR Vulnerabilities

Whisper models, stemming from foundation models tailored for complex audio processing tasks, guide their output through 'special tokens'. These tokens, inherently built to refine the model’s functionality, have opened an entrancing door to exploit these models through what’s known as adversarial attacks.

Current research unfolds a critical security flaw in Whisper models by utilizing these special tokens, specifically <endoftext> (EOT). This token, intended to mark the cessation of transcription in an ASR process, can ironically be exploited to force the model into silencing itself—ignoring the actual audio content it’s tasked with transcribing.

The Core Exploit: Muting Whisper

The proposed attack leverages a remarkable characteristic of Whisper. By appending a short, 0.64-second adversarial audio segment that mimics the EOT token to any speech input, researchers were able to reliably 'mute' the model. This means that Whisper, when confronted with this altered audio, produces little to no transcription of the actual speech content following the adversarial insertion.

Empirical results highlighted an impressive efficacy, with the attack achieving more than 97% success rate in rendering the Whisper models silent over a set of unseen speech samples.

Implications of Muting Adversarial Attacks

Risks and Rewards

The implications of such muting attacks are twofold:

  • Risks: One could abuse this by bypassing moderation systems designed to flag harmful content in digital mediums, spreading unchecked misinformation or harmful speech.
  • Rewards: The more benign and indeed beneficial use could be in safeguarding privacy. In a scenario where confidential speeches need transmission over insecure channels, preemptively muting them using this method could prevent malicious entities from transcribing these communications.

Transference and Adaptability

  • Universal Capability: The adversarial segment can be prepended to any speech content indiscriminately, confirming its universal applicability.
  • Cross-Domain Efficacy: It’s shown to translate its muting ability across different datasets and is relatively robust against changes in data domains.
  • Task Versatility: Beyond mere transcription, this attack segment can mute systems engaged in speech translation tasks, although its effectiveness is slightly tempered depending on the task and language distance from English.

Future Prospects and Protective Measures

Given the effectiveness of this adversarial technique, it opens discussions about future developments in AI security, particularly in ASR models. It becomes crucial to advance research into not just detecting such adversarial attacks but also in devising robust models impervious to such exploitations.

Furthermore, the community must consider the ethical ramifications of adversarial research. While this work offers a methodology potent in privacy protection, it equally grants tools for malicious misuse if left unchecked in the public domain. It emphasizes an essential paradigm in AI development: advancements in technology must be parallelled by advancements in ethical guidelines and security measures.

In Brief

The discovery and demonstration of using a universal adversarial audio segment to mute Whisper models significantly underscores a critical vulnerability in speech processing systems. It reinforces a need for ongoing vigilance and adaptive security measures in AI models dealing with sensitive information. On a broader scale, it advocates for a balanced view where technological prowess is matched with responsibility and readiness against adversarial threats.

X Twitter Logo Streamline Icon: https://streamlinehq.com