Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework (2505.18864v1)

Published 24 May 2025 in cs.CL

Abstract: Recent advances in Multimodal LLMs (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.

Authors (4)

Binhao Ma (3 papers)
Hanqing Guo (29 papers)
Zhengping Jay Luo (3 papers)
Rui Duan (23 papers)

Summary

Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT

The paper "Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework" by Ma et al. meticulously dissects the susceptibility of SpeechGPT, a multimodal LLM (MLLM), to adversarial audio attacks. The authors have innovatively probed the vulnerabilities in the voice modality, a domain that has traditionally remained underexplored, despite being integral to the enhanced human-computer interaction capabilities of these models.

The paper embarks upon the proposition that voice-enabled MLLMs, while facilitating nuanced human-like interaction, introduce novel security threats not discernible in text-based modalities. SpeechGPT, as an exemplar, serves as a basis for the proposed white-box adversarial attack framework focusing on speech input. The researchers introduce a token-level attack strategy that manipulates the model's speech tokenization mechanism. This approach achieves an impressive 89% attack success rate across various restrictive tasks, thus questioning the robustness of current alignment and safety protocols in MLLMs against voice-based threats.

Key Contributions and Results

The paper comprises several notable contributions to the field:

Development of a Novel White-Box Attack: The paper presents an automated token-level adversarial attack that leverages insights into SpeechGPT’s speech tokenization structure. The authors convert adversarial token sequences into audio prompts, thereby managing to circumvent alignment safeguards without relying on handcrafted or conceptually human inputs.
Evaluation and Success Rate: Evaluated against six categories derived from OpenAI's usage policy, including illegal activity and hate speech, the proposed method demonstrates up to 89% success in eliciting harmful outputs. This performance notably surpasses traditional black-box and text-to-speech adversarial methods.
Exploration of Token-Level Adversarial Strategy: The research advances understanding of token-based adversarial tactics in the context of audio inputs. By utilizing greedy search strategies, the paper illustrates the efficacy of exploiting speech token sequences to influence model outputs detrimentally.

Implications

In a broader context, the paper underscores critical security considerations as MLLMs become more entwined with mainstream technologies such as smartphones and AI-powered virtual assistants. The paper highlights that current text-based safety protocols are insufficient when applied to audio inputs. This vulnerability calls for a re-evaluation of adversarial defense techniques, especially as voice interfaces increasingly dominate user-device interactions.

From a theoretical perspective, the research enriches the discourse on MLLM's security paradigms. The introduction of a white-box adversarial framework pivots the focus from conventional text manipulation to audio, urging AI safety research to encompass a multi-faceted defense strategy that caters comprehensively to various modalities.

Future Directions

The paper opens multiple avenues for future exploration and defense strategies against adversarial audio attacks. These include advancements in audio denoising techniques that work at the token level and the possibility of adversarial training specifically designed for multimodal inputs. Another intriguing line of inquiry suggested by the authors is exploring cross-model transferability to enhance the robustness of adversarial strategies beyond the SpeechGPT model.

Overall, the paper represents a significant stride in identifying and addressing the intrinsic vulnerabilities in the audio modality of MLLMs, signifying a pivotal call for adaptive and integrative security frameworks in AI systems. As this domain evolves, the ongoing exploration of adversarial threats and mitigations will play a crucial role in shaping the future landscape of AI-driven communication technologies.

Related Papers

Find Related Papers

YouTube

Show All Videos