Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks (2404.03027v4)

Published 3 Apr 2024 in cs.CR, cs.AI, and cs.CL
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks

Abstract: With the rapid advancements in Multimodal LLMs (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak LLMs can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.

JailBreakV-28K: Evaluating Multimodal LLMs' Robustness to Jailbreak Attacks

Introduction to JailBreakV-28K

The rapid advancement of Multimodal LLMs (MLLMs) has necessitated an examination of these models' robustness against jailbreak attacks. This paper introduces JailBreakV-28K, a benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs. The benchmark encompasses a dataset of 28,000 test cases, comprising both text-based and image-based jailbreak inputs designed to probe the models' vulnerabilities. A notable aspect of this research is its focus on the potential for LLM jailbreak techniques to be effectively employed against MLLMs, highlighting a critical area of vulnerability stemming from the text-processing capabilities of these models.

Crafting the JailBreakV-28K Dataset

The creation of the JailBreakV-28K benchmark involved several steps, starting with the collation of a dataset dubbed RedTeam-2K, consisting of 2,000 malicious queries. This dataset served as the foundation for generating a wider array of jailbreak prompts. Subsequently, leveraging advanced jailbreak attacks on LLMs and recent image-based MLLMs jailbreak attacks, 20,000 text-based and 8,000 image-based jailbreak inputs were produced. This comprehensive benchmark not only evaluates models' robustness from a multimodal perspective but also significantly extends the scope and scale of safety assessments in MLLMs beyond existing benchmarks.

Insights from Evaluating MLLMs with JailBreakV-28K

The assessment of ten open-source MLLMs utilizing JailBreakV-28K revealed notable findings. Particularly, attacks transferred from LLMs exhibited a high Attack Success Rate (ASR), indicating a significant vulnerability across MLLMs due to their textual input processing capabilities. The research illuminated several crucial insights:

  • Textual jailbreak prompts that compromise LLMs are likely to be effective against MLLMs as well.
  • The effectiveness of textual jailbreak prompts appears largely independent of the accompanying image input.
  • The dual vulnerabilities posed by textual and visual inputs necessitate a multifaceted approach to aligning MLLMs with safety standards.

These findings underscore the pressing need for research focused on mitigating alignment vulnerabilities related to both text and image inputs in MLLMs.

The Implications and Future Directions

The JailBreakV-28K benchmark sheds light on the intrinsic vulnerabilities within MLLMs, particularly highlighting the transferability of jailbreak techniques from LLMs. This insight is crucial for future developments in AI safety, pointing towards the necessity for robust defense mechanisms that account for multimodal inputs. Moreover, the findings from this research are poised to guide future explorations into designing MLLMs that are resilient against a broader spectrum of adversarial attacks, thereby ensuring these models are aligned with human values and can be safely deployed in real-world applications.

In conclusion, JailBreakV-28K represents a significant step forward in understanding and addressing the vulnerabilities of MLLMs to jailbreak attacks. As MLLMs continue to evolve and find applications across diverse domains, ensuring their robustness and alignment will remain a pivotal area of research. This benchmark not only provides a critical tool for assessing model vulnerabilities but also opens avenues for ongoing advancements in the safe development and deployment of MLLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Usage policies - openai. https://openai.com/policies/usage-policies, 2024. URL https://openai.com/policies/usage-policies. Accessed: 2024-01-12.
  2. Stability AI. Stable Diffusion XL Base 1.0: A Diffusion-based Text-to-Image Generative Model. https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0, 2023. Accessed: 2024-03-24.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  6. Jailbreaking black box large language models in twenty queries, 2023.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  9. Multilingual jailbreak challenges in large language models, 2023.
  10. Multilingual jailbreak challenges in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vESNKdEMGp.
  11. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024.
  12. How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751, 2023.
  13. Glm: General language model pretraining with autoregressive blank infilling, 2022.
  14. Llama-adapter v2: Parameter-efficient visual instruction model, 2023.
  15. Figstep: Jailbreaking large vision-language models via typographic visual prompts, 2023.
  16. Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning, 2024.
  17. Efficient multimodal learning from data-centric perspective, 2024.
  18. Large multilingual models pivot zero-shot multimodal learning across languages, 2024.
  19. Catastrophic jailbreak of open-source llms via exploiting generation, 2023.
  20. Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh.
  21. Hugging Face. all-mpnet-base-v2. https://huggingface.co/sentence-transformers/all-mpnet-base-v2, 2024. Accessed: 2024-01-19.
  22. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.
  23. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
  24. Mistral 7b, 2023.
  25. Wikihow: A large scale text summarization dataset, 2018.
  26. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  27. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models, 2024a.
  28. RAIN: Your language models can align themselves without finetuning. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=pETSfWMUzy.
  29. Improved baselines with visual instruction tuning, 2023a.
  30. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024a.
  31. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models, 2024b.
  32. Jailbreaking chatgpt via prompt engineering: An empirical study, 2023b.
  33. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024.
  34. Meta AI. Llama 2 - acceptable use policy. https://ai.meta.com/llama/use-policy/, 2024. Accessed: 2024-01-19.
  35. Microsoft. Phi-2: A 2.7 billion parameter transformer model. https://huggingface.co/microsoft/phi-2, 2023. Accessed: 2024-03-04.
  36. Jailbreaking Attack against Multimodal Large Language Model. arXiv preprint arXiv:2402.02309, 2024.
  37. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  21527–21536, 2024.
  38. Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. arXiv preprint arXiv:2307.14539, 2023.
  39. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=plmBsXHxgR.
  40. ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
  41. InfiMM Team. Infimm: Advancing multimodal understanding from flamingo’s legacy through diverse llm integration, 2024. URL https://huggingface.co/Infi-MM/.
  42. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023a.
  43. The Vicuna Team. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023b. URL https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2024-03-04.
  44. Baichuan Intelligent Technology. Baichuan-7b: An open-source large-scale pre-trained model. https://huggingface.co/baichuan-inc/Baichuan-7B, 2023. Accessed: 2024-03-04.
  45. Llama 2: Open foundation and fine-tuned chat models, 2023.
  46. Zephyr: Direct distillation of lm alignment, 2023.
  47. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  48. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation, 2022.
  49. Cognitive overload: Jailbreaking large language models with overloaded logical thinking, 2024.
  50. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models, 2024.
  51. GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=MbfAK4s61A.
  52. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024.
  53. Universal and transferable adversarial attacks on aligned language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Weidi Luo (8 papers)
  2. Siyuan Ma (39 papers)
  3. Xiaogeng Liu (19 papers)
  4. Xiaoyu Guo (34 papers)
  5. Chaowei Xiao (110 papers)
Citations (40)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets