Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? (2404.03411v2)

Published 4 Apr 2024 in cs.LG, cs.CL, and cs.CR

Abstract: Various jailbreak attacks have been proposed to red-team LLMs and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal LLMs (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found https://github.com/chenxshuo/RedTeamingGPT4V

Comprehensive Examination of Jailbreak Attacks Against GPT-4 and Multimodal LLMs

Introduction to Jailbreak Attacks

Jailbreak attacks on LLMs and Multimodal LLMs (MLLMs) pose significant risks as they can elicit harmful or unethical responses from models designed to avoid generating such content. This work explores evaluating the robustness of state-of-the-art (SOTA) proprietary and open-source models, including GPT-4 and GPT-4V, against an array of textual and visual jailbreak attack methods. Prior methodologies lack a universal benchmark for fair performance comparison, and there is a notable scarcity in the comprehensive assessment of commercial, top-tier models against jailbreak attacks. This gap is bridged by introducing a meticulously curated jailbreak evaluation dataset comprising 1445 questions spread across 11 different safety policies. The investigation extends to 11 different LLMs and MLLMs, revealing nuances in model robustness and method transferability.

Dataset and Experimentation Framework

To scaffold a universal evaluation framework, a broad and diverse jailbreak dataset was assembled from existing literature, covering a spectrum of harmful behaviors and questions across 11 varied safety policies. The dataset serves as the foundation for exhaustive red-teaming experiments on both proprietary (GPT-4, GPT-4V) and open-source models (Llama2, MiniGPT4). Techniques employed in these experiments ranged from hand-crafted modifications to sophisticated optimization-based attacks, aiming to skirt around the models' built-in safety measures.

Key Findings from Red-Teaming Experiments

Model Robustness Against Jailbreak Attacks

  • GPT-4 and GPT-4V exhibit superior robustness over their open-source counterparts, displaying a lower susceptibility to both textual and visual jailbreak methods.
  • Among the open-source models assessed, Llama2 emerges as notably robust, presenting a compelling case for its safety alignment training, despite being more vulnerable to certain automatic jailbreak methods than GPT-4.
  • Transferability of Jailbreak Methods: The paper found that textual modification methods, such as AutoDAN, showcased a higher degree of transferability than visual methods when employed against different models.

Insights into Jailbreak Methodologies

  • No singular jailbreak method proved universally dominant across all models tested, underscoring the diversity in model vulnerabilities and the nuanced nature of model defenses.
  • Visual jailbreak methods, despite their conceptual appeal, demonstrated limited efficacy against GPT-4V, hinting at robust underlying mechanisms to counter such attacks.

Implications and Future Directions

The differential robustness of proprietary models like GPT-4 and GPT-4V compared to open-source variants underscores a significant gap that merits further exploration. Specifically, the paper illuminates the critical need for advancing safety regulations and defenses in LLMs and MLLMs, especially as these models become increasingly integrated into real-world applications. Additionally, it hints at the potential for future work to focus on refining visual jailbreak methodologies and exploring more sophisticated transferability mechanisms.

The profound insights garnered from the extensive red-teaming effort offer a granular view into the current state of model vulnerabilities and defenses against jailbreak attacks. Moving forward, this work will undoubtedly spur further research into developing more resilient AI models, driving the evolution of effective countermeasures against evolving threats in the rapidly advancing landscape of LLMs and MLLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. AdeptAI. Fuyu-8b model card, 2024. https://huggingface.co/adept/fuyu-8b [Accessed: (2024.2.10)].
  3. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  4. Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
  5. Are aligned neural networks adversarially aligned? In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Josephus Cheung. Guanaco - generative universal assistant for natural-language adaptive context-aware omnilingual outputs, 2024. https://huggingface.co/JosephusCheung/Guanaco [Accessed: (2024.2.10)].
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  9. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  10. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320–335, 2022.
  11. Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
  12. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  13. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  14. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
  15. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=jTiJPDv82w.
  16. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  17. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b.
  18. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023c.
  19. OpenAI. Gpt model documentation, 2024. https://platform.openai.com/docs/models/overview [Accessed: (2024.2.10)].
  20. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  21. Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
  22. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
  23. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539, 2023.
  24. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  25. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  26. Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950, 2023a.
  27. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
  28. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  29. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  30. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  31. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  32. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  33. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  34. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  35. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shuo Chen (127 papers)
  2. Zhen Han (54 papers)
  3. Bailan He (12 papers)
  4. Zifeng Ding (26 papers)
  5. Wenqian Yu (2 papers)
  6. Philip Torr (172 papers)
  7. Volker Tresp (158 papers)
  8. Jindong Gu (101 papers)
Citations (13)