Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models (2405.06134v2)

Published 9 May 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate special tokens' in their vocabulary, such as $\texttt{<|endoftext|>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectivelymuting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.

References (37)

Authors (5)

Vyas Raina (18 papers)
Rao Ma (22 papers)
Charles McGhee (2 papers)
Kate Knill (11 papers)
Mark Gales (52 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that appending a 0.64-second adversarial audio segment reliably mutes Whisper models with over 97% success.
The paper reveals that leveraging Whisper’s special <endoftext> token exposes key vulnerabilities in automatic speech recognition.
The paper highlights dual implications, suggesting risks in security loopholes and benefits for protecting confidential communications.

Exploring Vulnerabilities in Whisper Through Universal Adversarial Audio Attacks

Overview of Whisper and ASR Vulnerabilities

Whisper models, stemming from foundation models tailored for complex audio processing tasks, guide their output through 'special tokens'. These tokens, inherently built to refine the model’s functionality, have opened an entrancing door to exploit these models through what’s known as adversarial attacks.

Current research unfolds a critical security flaw in Whisper models by utilizing these special tokens, specifically <endoftext> (EOT). This token, intended to mark the cessation of transcription in an ASR process, can ironically be exploited to force the model into silencing itself—ignoring the actual audio content it’s tasked with transcribing.

The Core Exploit: Muting Whisper

The proposed attack leverages a remarkable characteristic of Whisper. By appending a short, 0.64-second adversarial audio segment that mimics the EOT token to any speech input, researchers were able to reliably 'mute' the model. This means that Whisper, when confronted with this altered audio, produces little to no transcription of the actual speech content following the adversarial insertion.

Empirical results highlighted an impressive efficacy, with the attack achieving more than 97% success rate in rendering the Whisper models silent over a set of unseen speech samples.

Implications of Muting Adversarial Attacks

Risks and Rewards

The implications of such muting attacks are twofold:

Risks: One could abuse this by bypassing moderation systems designed to flag harmful content in digital mediums, spreading unchecked misinformation or harmful speech.
Rewards: The more benign and indeed beneficial use could be in safeguarding privacy. In a scenario where confidential speeches need transmission over insecure channels, preemptively muting them using this method could prevent malicious entities from transcribing these communications.

Transference and Adaptability

Universal Capability: The adversarial segment can be prepended to any speech content indiscriminately, confirming its universal applicability.
Cross-Domain Efficacy: It’s shown to translate its muting ability across different datasets and is relatively robust against changes in data domains.
Task Versatility: Beyond mere transcription, this attack segment can mute systems engaged in speech translation tasks, although its effectiveness is slightly tempered depending on the task and language distance from English.

Future Prospects and Protective Measures

Given the effectiveness of this adversarial technique, it opens discussions about future developments in AI security, particularly in ASR models. It becomes crucial to advance research into not just detecting such adversarial attacks but also in devising robust models impervious to such exploitations.

Furthermore, the community must consider the ethical ramifications of adversarial research. While this work offers a methodology potent in privacy protection, it equally grants tools for malicious misuse if left unchecked in the public domain. It emphasizes an essential paradigm in AI development: advancements in technology must be parallelled by advancements in ethical guidelines and security measures.

In Brief

The discovery and demonstration of using a universal adversarial audio segment to mute Whisper models significantly underscores a critical vulnerability in speech processing systems. It reinforces a need for ongoing vigilance and adaptive security measures in AI models dealing with sensitive information. On a broader scale, it advocates for a balanced view where technological prowess is matched with responsibility and readiness against adversarial threats.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ArxivSound/status/1789869189009645794