Voice Jailbreak Attacks Against GPT-4o (2405.19103v1)

Published 29 May 2024 in cs.CR and cs.LG

Abstract: Recently, the concept of artificial assistants has evolved from science fiction into real-world applications. GPT-4o, the newest multimodal LLM (MLLM) across audio, vision, and text, has further blurred the line between fiction and reality by enabling more natural human-computer interactions. However, the advent of GPT-4o's voice mode may also introduce a new attack surface. In this paper, we present the first systematic measurement of jailbreak attacks against the voice mode of GPT-4o. We show that GPT-4o demonstrates good resistance to forbidden questions and text jailbreak prompts when directly transferring them to voice mode. This resistance is primarily due to GPT-4o's internal safeguards and the difficulty of adapting text jailbreak prompts to voice mode. Inspired by GPT-4o's human-like behaviors, we propose VoiceJailbreak, a novel voice jailbreak attack that humanizes GPT-4o and attempts to persuade it through fictional storytelling (setting, character, and plot). VoiceJailbreak is capable of generating simple, audible, yet effective jailbreak prompts, which significantly increases the average attack success rate (ASR) from 0.033 to 0.778 in six forbidden scenarios. We also conduct extensive experiments to explore the impacts of interaction steps, key elements of fictional writing, and different languages on VoiceJailbreak's effectiveness and further enhance the attack performance with advanced fictional writing techniques. We hope our study can assist the research community in building more secure and well-regulated MLLMs.

References (50)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces VoiceJailbreak, a narrative-based method that boosts the attack success rate from 0.033 to 0.778.
The study adapts text-based prompts to the voice modality, revealing significant resistance due to GPT-4o's built-in safety protocols.
The findings stress the urgent need for enhanced security measures in multimodal systems, particularly for robust anomaly detection in voice interfaces.

Voice Jailbreak Attacks Against GPT-4o

The paper "Voice Jailbreak Attacks Against GPT-4o" presents a novel investigative paper concerning the vulnerabilities of multimodal LLMs (MLLMs), specifically targeting the voice modality of GPT-4o. GPT-4o, an advanced multimodal model that processes audio, vision, and text, represents a significant step forward in achieving more natural human-computer interactions. However, this advancement simultaneously introduces new potential attack vectors. The paper meticulously explores the susceptibility of GPT-4o's voice functionalities to jailbreak attacks.

Key Findings and Methodology

The authors systematically measure the efficacy of traditional text-based jailbreak prompts adapted to audio forms when employed against the voice-mode of GPT-4o. They discover that the model exhibits considerable resistance to such direct adaptations due to its intrinsic safety alignments and the maladaptation challenges of converting text prompts to voice inputs. Notably, typical text jailbreak prompts resulted in low attack success rates, averaging between 0.033 to 0.233 across various forbidden scenarios, including illegal activities and hate speech.

Recognizing the need for an innovative approach, the researchers introduce "VoiceJailbreak," a strategy leveraging fictional storytelling elements to craft persuasive voice prompts. By structuring prompts with settings, characters, and plots reminiscent of immersive storytelling, this method significantly enhances the attack success rate. Specifically, VoiceJailbreak improved the average attack success rate from 0.033 to an impressive 0.778, illustrating a far more effective breach into restricted content areas.

Experimentation and Results

In evaluating VoiceJailbreak, the authors explore various fictional writing components and advanced literary devices to optimize attack prompts. They integrate plot elements and narrative perspectives to bypass safeguards, achieving high compliance across forbidden queries. Extended experiments analyze the impacts of interaction steps, language variations, and advanced narrative techniques like foreshadowing and point of view. The paper finds that a two-step interactive approach further amplifies jailbreak efficacy.

Implications and Future Directions

This research underscores the inherent security challenges in deploying MLLMs like GPT-4o in contexts where audio input capabilities are critical. Given the effective increase in success rates using VoiceJailbreak, the paper highlights an urgent need for developers to revisit and bolster security protocols in voice-interactive systems. Potential defenses could involve developing more robust anomaly detection systems specifically tuned for multimodal inputs.

Future research may extend towards exploring other untapped modalities within MLLMs that could introduce vulnerabilities. Additionally, the implications for ethical considerations on using advanced storytelling tactics to induce policy violations in AI are substantial. As multimodal interfaces become pervasive, understanding and securing these systems against sophisticated attack strategies will be crucial in maintaining safe and reliable AI interactions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xyshen365/status/1796499231165014144

https://twitter.com/ADarmouni/status/1796677391269454251

https://twitter.com/gm8xx8/status/1796005589291524157

https://twitter.com/arxivsanitybot/status/1796895597187739783

https://twitter.com/syntrop2/status/1796209698623967399

https://twitter.com/FSFG/status/1796095905063375001

YouTube

Show All Videos