Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts (2311.09127v2)

Published 15 Nov 2023 in cs.CR, cs.AI, and cs.LG

Abstract: Existing work on jailbreak Multimodal LLMs (MLLMs) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities, especially in model API. To fill the research gap, we carry out the following work: 1) We discover a system prompt leakage vulnerability in GPT-4V. Through carefully designed dialogue, we successfully extract the internal system prompts of GPT-4V. This finding indicates potential exploitable security risks in MLLMs; 2) Based on the acquired system prompts, we propose a novel MLLM jailbreaking attack method termed SASP (Self-Adversarial Attack via System Prompt). By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. Furthermore, in pursuit of better performance, we also add human modification based on GPT-4's analysis, which further improves the attack success rate to 98.7\%; 3) We evaluated the effect of modifying system prompts to defend against jailbreaking attacks. Results show that appropriately designed system prompts can significantly reduce jailbreak success rates. Overall, our work provides new insights into enhancing MLLM security, demonstrating the important role of system prompts in jailbreaking. This finding could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Flamingo: a visual language model for few-shot learning.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  3. Constitutional ai: Harmlessness from ai feedback.
  4. Image hijacks: Adversarial images can control generative models at runtime.
  5. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
  6. Jailbreaker: Automated jailbreak across multiple large language model chatbots.
  7. Masterkey: Automated jailbreak across multiple large language model chatbots.
  8. How robust is google’s bard to adversarial image attacks?
  9. Surav Shrestha Dustin Miller, Michael Skyba. 2023. Behind the scenes. https://github.com/spdustin/ChatGPT-AutoExpert/blob/main/System%20Prompts.md#behind-the-scenes.
  10. Pretraining language models with human preferences.
  11. Improved baselines with visual instruction tuning.
  12. Visual instruction tuning.
  13. Prompt injection attack against llm-integrated applications.
  14. Jailbreaking chatgpt via prompt engineering: An empirical study.
  15. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
  16. Self-refine: Iterative refinement with self-feedback.
  17. Is a prompt and a few samples all you need? using gpt-4 for data augmentation in low-resource classification tasks.
  18. Dera: Enhancing large language model completions with dialog-enabled resolving agents.
  19. OpenAI. 2022. Gpt models - openai api. https://platform.openai.com/docs/guides/moderation. (Accessed on 02/02/2023).
  20. OpenAI. 2023. Gpt-4v(ision) system card.
  21. Training language models to follow instructions with human feedback.
  22. Promptor: A conversational and autonomous prompt generation agent for intelligent text entry techniques. arXiv preprint arXiv:2310.08101.
  23. Autoprompt: Eliciting knowledge from language models with automatically generated prompts.
  24. VISHESH THAKUR. 2022. Celebrity face image dataset.
  25. Jailbroken: How does llm safety training fail?
  26. Large language models as optimizers.
  27. mplug-owl: Modularization empowers large language models with multimodality.
  28. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.
  29. System-level natural language feedback.
  30. Yiming Zhang and Daphne Ippolito. 2023. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success.
  31. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv e-prints, page arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuanwei Wu (21 papers)
  2. Xiang Li (1002 papers)
  3. Yixin Liu (108 papers)
  4. Pan Zhou (220 papers)
  5. Lichao Sun (186 papers)
Citations (40)