Voice Jailbreak Attacks Against GPT-4o (2405.19103v1)
Abstract: Recently, the concept of artificial assistants has evolved from science fiction into real-world applications. GPT-4o, the newest multimodal LLM (MLLM) across audio, vision, and text, has further blurred the line between fiction and reality by enabling more natural human-computer interactions. However, the advent of GPT-4o's voice mode may also introduce a new attack surface. In this paper, we present the first systematic measurement of jailbreak attacks against the voice mode of GPT-4o. We show that GPT-4o demonstrates good resistance to forbidden questions and text jailbreak prompts when directly transferring them to voice mode. This resistance is primarily due to GPT-4o's internal safeguards and the difficulty of adapting text jailbreak prompts to voice mode. Inspired by GPT-4o's human-like behaviors, we propose VoiceJailbreak, a novel voice jailbreak attack that humanizes GPT-4o and attempts to persuade it through fictional storytelling (setting, character, and plot). VoiceJailbreak is capable of generating simple, audible, yet effective jailbreak prompts, which significantly increases the average attack success rate (ASR) from 0.033 to 0.778 in six forbidden scenarios. We also conduct extensive experiments to explore the impacts of interaction steps, key elements of fictional writing, and different languages on VoiceJailbreak's effectiveness and further enhance the attack performance with advanced fictional writing techniques. We hope our study can assist the research community in building more secure and well-regulated MLLMs.
- https://platform.openai.com/docs/models/gpt-3-5-turbo.
- Apple is finalizing a deal with OpenAI to put ChatGPT on the iPhone, while talks with Google to use Gemini are ongoing. https://fortune.com/2024/05/11/apple-openai-chatgpt-iphone-ios-18-google-gemini-ai-chatbot/.
- Coleman-Liau index. https://en.wikipedia.org/wiki/Coleman-Liau_index.
- Foreshadowing. https://en.wikipedia.org/wiki/Foreshadowing.
- GPT-4o. https://openai.com/index/hello-gpt-4o/.
- GPT-4V. https://openai.com/research/gpt-4v-system-card.
- Introducing Copilot+ PCs. https://blogs.microsoft.com/blog/2024/05/20/introducing-copilot-pcs/.
- People are suddenly flocking to ChatGPT Plus on mobile - here’s why. https://www.zdnet.com/article/people-are-suddenly-flocking-to-chatgpt-plus-on-mobile-heres-why/.
- Point of View. https://en.wikipedia.org/wiki/Narration#Point_of_view.
- Red herring. https://en.wikipedia.org/wiki/Red_herring.
- Beth Hill. Plot, Setting, and Character—Fiction’s Top 3. https://theeditorsblog.net/2011/02/24/plot-setting-and-character-fictions-top-3/.
- Extracting Training Data from Large Language Models. In USENIX Security Symposium (USENIX Security), pages 2633–2650. USENIX, 2021.
- Jailbreaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023.
- StruQ: Defending Against Prompt Injection with Structured Queries. CoRR abs/2402.06363, 2024.
- Comprehensive Assessment of Jailbreak Attacks Against LLMs. CoRR abs/2402.05668, 2024.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023.
- Multilingual Jailbreak Challenges in Large Language Models. CoRR abs/2310.06474, 2023.
- Fictionary. The 3 Key Story Elements In Fiction Writing. https://writingcooperative.com/the-3-key-story-elements-in-fiction-writing-74cda42ba2c3.
- FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts. CoRR abs/2311.05608, 2023.
- More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models. CoRR abs/2302.12173, 2023.
- Rust Hills. Writing in General and the Short Story in Particular. Houghton Mifflin Harcourt, 1987.
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. CoRR abs/2310.06987, 2023.
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733, 2023.
- Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023.
- Malla: Demystifying Real-world Large Language Model Integrated Malicious Services. CoRR abs/2401.03315, 2024.
- Visual Instruction Tuning. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023.
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023.
- InstrPrompt Injection Attacks and Defenses in LLM-Integrated Applications. CoRR abs/2310.12815, 2023.
- Analyzing Leakage of Personally Identifiable Information in Language Models. In IEEE Symposium on Security and Privacy (S&P), pages 346–363. IEEE, 2023.
- Lingjuan Lyu. A Pathway Towards Responsible AI Generated Content. In International Joint Conferences on Artifical Intelligence (IJCAI), pages 7033–7038. IJCAI, 2023.
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. CoRR abs/abs/2402.04249, 2024.
- Robert McKee. Story: Substance, Structure, Style, and the Principles of Screenwriting. ReganBooks, 1997.
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. CoRR abs/2312.02119, 2023.
- Scalable Extraction of Training Data from (Production) Language Models. CoRR abs/2311.17035, 2023.
- OpenAI. Usage policies. https://openai.com/policies/usage-policies.
- OpenAI. GPT-4 Technical Report. CoRR abs/2303.08774, 2023.
- Ignore Previous Prompt: Attack Techniques For Language Models. CoRR abs/2211.09527, 2022.
- Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2023.
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. CoRR abs/2310.03684, 2023.
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation. CoRR abs/2311.03348, 2023.
- Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2024.
- A StrongREJECT for Empty Jailbreaks. CoRR abs/2402.10260, 2024.
- TrustLLM: Trustworthiness in Large Language Models. CoRR abs/2401.05561, 2024.
- Jailbroken: How Does LLM Safety Training Fail? CoRR abs/2307.02483, 2023.
- Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence, 2023.
- Low-Resource Languages Jailbreak GPT-4. CoRR abs/2310.02446, 2023.
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts. CoRR abs/2309.10253, 2023.
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. CoRR abs/2401.06373, 2024.
- Synthetic Lies: Understanding AI-Generated Misinformation and Evaluating Algorithmic and Human Solutions. In Annual ACM Conference on Human Factors in Computing Systems (CHI), pages 436:1–436:20. ACM, 2023.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023.