Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

Published 26 Jul 2023 in cs.CR and cs.CL | (2307.14539v2)

Abstract: We introduce new jailbreak attacks on vision LLMs (VLMs), which use aligned LLMs and are resilient to text-only jailbreak attacks. Specifically, we develop cross-modality attacks on alignment where we pair adversarial images going through the vision encoder with textual prompts to break the alignment of the LLM. Our attacks employ a novel compositional strategy that combines an image, adversarially targeted towards toxic embeddings, with generic prompts to accomplish the jailbreak. Thus, the LLM draws the context to answer the generic prompt from the adversarial image. The generation of benign-appearing adversarial images leverages a novel embedding-space-based methodology, operating with no access to the LLM model. Instead, the attacks require access only to the vision encoder and utilize one of our four embedding space targeting strategies. By not requiring access to the LLM, the attacks lower the entry barrier for attackers, particularly when vision encoders such as CLIP are embedded in closed-source LLMs. The attacks achieve a high success rate across different VLMs, highlighting the risk of cross-modality alignment vulnerabilities, and the need for new alignment approaches for multi-modal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gama: Generative adversarial multi-object scene attacks. Advances in Neural Information Processing Systems, 35:36914–36930, 2022.
  2. (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
  3. Image hijacking: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
  4. Google Bard. What’s ahead for bard: More global, more visual, more integrated. https://blog.google/technology/ai/google-bard-updates-io-2023/.
  5. Microsoft Bing. Bing chat enterprise announced, multimodal visual search rolling out to bing chat. https://blogs.bing.com/search/july-2023/Bing-Chat-Enterprise-announced,-multimodal-Visual-Search-rolling-out-to-Bing-Chat.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2633–2650, 2021.
  8. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  9. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  10. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
  11. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  12. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023.
  13. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
  14. HuggingFaceCLIP. https://huggingface.co/openai/clip-vit-large-patch14.
  15. Trustworthy artificial intelligence: a review. ACM Computing Surveys (CSUR), 55(2):1–38, 2022.
  16. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp.  4171–4186, 2019.
  17. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. Adversarial machine learning at scale. In International Conference on Learning Representations, 2016.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  20. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  22. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023b.
  23. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023c.
  24. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  25. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  15009–15018, 2023.
  26. Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479, 2023.
  27. OpenAI ModerationOpenAI. Moderation endpoint openai. https://platform.openai.com/docs/guides/moderation/overview, 2023.
  28. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2574–2582, 2016.
  29. David A Noever and Samantha E Miller Noever. Reading isn’t believing: Adversarial attacks on multi-modal neurons. arXiv preprint arXiv:2103.10480, 2021.
  30. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  31. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop, 2022.
  32. Generative adversarial perturbations. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4422–4431. IEEE Computer Society, 2018.
  33. Visual adversarial examples jailbreak aligned large language models, 2023.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  35. On the adversarial robustness of multi-modal foundation models. arXiv preprint arXiv:2308.10741, 2023.
  36. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  37. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  38. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  39. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  40. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  41. Beyond imagenet attack: Towards crafting adversarial examples for black-box domains. In International Conference on Learning Representations, 2022.
  42. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023.
  43. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  44. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (80)

Summary

  • The paper introduces a novel approach that decomposes adversarial attacks into benign images with malicious embedding triggers, exposing vulnerabilities in multi-modal systems.
  • It employs an end-to-end gradient-based optimization to craft triggers from textual, OCR, and visual cues, achieving high success rates across various VLMs.
  • The findings underscore the urgent need for robust multi-modal alignment strategies to secure AI systems against composite adversarial threats.

Jailbreak in Pieces: Compositional Adversarial Attacks on Multi-Modal LLMs

Introduction

The paper "Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal LLMs" (2307.14539) introduces novel adversarial attacks on Vision LLMs (VLMs) that exploit cross-modality interactions between vision and language components. These attacks bypass traditional text-only defenses by coupling adversarial images with textual prompts to disrupt model alignment, thereby inducing unintended outputs. The authors propose a compositional strategy utilizing embedding-space-based attacks that highlight a crucial vulnerability in the design of multi-modal systems.

Attack Methodologies

The central technique involves decomposing an adversarial attack into benign-looking images embedding malicious triggers, combined with generic textual prompts. This approach exploits the joint embedding space shared by the vision and text components of VLMs like CLIP. The adversarial images are generated to inhabit regions of the embedding space that correlate with malicious intents when processed by the vision encoder. The goal is to induce the model to draw unintended but harmful conclusions when responding to benign-sounding prompts. Figure 1

Figure 1: Overview of the proposed methods for embedding-space-based adversarial attacks on VLMs.

Embedding-Space-Based Adversarial Attacks

The authors employ an end-to-end gradient-based optimization strategy to craft adversarial images that closely match the embeddings of harmful triggers. The approach involves four categories of triggers: textual, Optical Character Recognition (OCR) text, visual content, and combined OCR visual triggers. By focusing on embedding-based triggers, these attacks do not require direct access to the underlying LLM, which marks a significant departure from traditional white-box attacks.

Experimental Results

The paper reports high success rates of adversarial attacks across various VLMs, particularly when employing visual and combined OCR visual triggers. Human evaluators confirmed these results, with the OCR visual triggers being particularly effective due to their ability to bypass traditional text-based alignment mechanisms. Figure 2

Figure 2: Success rates for different malicious trigger strategies across multiple attack scenarios.

Discussion

The implications of these findings underline the vulnerability of multi-modal systems to composite attacks that exploit cross-modality coupling. Since the proposed attacks do not require access to the LLM, they pose a low-entry barrier compared to traditional adversarial strategies. The paper highlights the need for developing robust alignment techniques capable of encompassing the entire model, mitigating risks across both text and image modalities.

Future Directions

The research opens several avenues for future work, including:

  • Developing comprehensive alignment strategies that integrate safeguards across all modes of input in multi-modal systems.
  • Exploring the feasibility of embedding-based defenses against compositional adversarial attacks.
  • Investigating the compositionality and generalization of attacks across other modalities, such as audio or video, in similar multi-modal systems.

Conclusion

The paper demonstrates that adversarial attacks on VLMs leveraging compositional strategies across embedding spaces are not only feasible but highly effective. This work highlights the need for advancing multi-modal alignment techniques to secure AI systems against subtle and sophisticated adversarial threats. The approach underscores the critical importance of considering entire systems, rather than isolated components, in the development of robust and secure AI models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.