Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models (2307.14539v2)

Published 26 Jul 2023 in cs.CR and cs.CL

Abstract: We introduce new jailbreak attacks on vision LLMs (VLMs), which use aligned LLMs and are resilient to text-only jailbreak attacks. Specifically, we develop cross-modality attacks on alignment where we pair adversarial images going through the vision encoder with textual prompts to break the alignment of the LLM. Our attacks employ a novel compositional strategy that combines an image, adversarially targeted towards toxic embeddings, with generic prompts to accomplish the jailbreak. Thus, the LLM draws the context to answer the generic prompt from the adversarial image. The generation of benign-appearing adversarial images leverages a novel embedding-space-based methodology, operating with no access to the LLM model. Instead, the attacks require access only to the vision encoder and utilize one of our four embedding space targeting strategies. By not requiring access to the LLM, the attacks lower the entry barrier for attackers, particularly when vision encoders such as CLIP are embedded in closed-source LLMs. The attacks achieve a high success rate across different VLMs, highlighting the risk of cross-modality alignment vulnerabilities, and the need for new alignment approaches for multi-modal models.

Compositional Adversarial Attacks on Vision-LLMs (VLMs)

This paper presents a novel approach for executing adversarial attacks on Vision-LLMs (VLMs) by exploiting cross-modality vulnerabilities. VLMs extend traditional text-based models by integrating vision capabilities, which is increasingly common for tasks requiring both image and text understanding. While text-only attacks are well-studied, the paper explores an area less explored: creating adversarial inputs that combine text and images to circumvent safety protocols of these models. This exploration reveals significant vulnerabilities in the current alignment strategies employed by VLMs.

Approach

The proposed attack methodology utilizes compositional strategies involving both text and vision inputs. The key innovation lies in combining benign-looking adversarially crafted images with generic textual prompts to disrupt model alignment, allowing potentially harmful content generation. The attack leverages the embedding space of VLMs, specifically targeting the embedding space of the vision encoder with malicious triggers. This bypasses the textual safeguards typically implemented in VLMs, as the alignment between modalities can be manipulated to reach undesired states without direct access to the LLM component.

Methodology

The authors detail a black-box approach that does not require access to the underlying LLM. Instead, adversarial images are generated using only the vision encoder, such as CLIP, frequent in closed-source systems. These images are crafted to match target embeddings corresponding to adversarial triggers in the joint vision-language space. The embedding-based attacks demonstrate a compositional nature, where adversarial images can be reused with varying text instructions to accomplish successful jailbreaks across multiple scenarios.

Experimental Results

The experiments show high success rates when utilizing adversarial images targeted at specific types of malicious triggers. Among the various strategies tested, combining Optical Character Recognition (OCR) textual triggers with visual content proved the most effective in bypassing model safety guards. Models like LLaVA and LLaMA-Adapter V2 were evaluated, showcasing vulnerabilities inherent to their alignment strategies between image and text modalities.

From a practical standpoint, the paper highlights a concerning level of robustness within popular VLM architectures, indicating that adversarial images can effectively contaminate the context and promote the generation of unsafe or biased outputs. Human evaluations align with automatic toxicity assessments, affirming the attack's ability to produce harmful content despite existing safety measures.

Implications and Future Work

The implications of this research are profound for AI safety, revealing that current alignment methods fail to adequately address cross-modality threats. This calls for a rethink in how models are trained to align not just individual modalities but their integration. The findings suggest that aligning models holistically across all input types might mitigate such adversarial exploits more effectively.

Looking forward, the paper opens several avenues for future research, including refining adversarial image generation techniques and developing more resilient multimodal alignment strategies. A focus on embedding space understanding and cross-modality interactions will be critical as we pursue more robust, aligned AI systems. Additionally, the approach sets a foundation for further exploration of black-box attacks that exploit commonly integrated elements like vision encoders without accessing proprietary LLMs, thus lowering the entry barrier for potential threats in real-world applications.

In conclusion, this paper contributes significantly to the domain of adversarial attacks in AI by highlighting and exploiting the vulnerabilities present in cross-modality integrations of VLMs. Such insights will guide defenses against increasingly sophisticated attacks, ensuring safe and reliable deployment of AI technologies integrating vision and language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gama: Generative adversarial multi-object scene attacks. Advances in Neural Information Processing Systems, 35:36914–36930, 2022.
  2. (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
  3. Image hijacking: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
  4. Google Bard. What’s ahead for bard: More global, more visual, more integrated. https://blog.google/technology/ai/google-bard-updates-io-2023/.
  5. Microsoft Bing. Bing chat enterprise announced, multimodal visual search rolling out to bing chat. https://blogs.bing.com/search/july-2023/Bing-Chat-Enterprise-announced,-multimodal-Visual-Search-rolling-out-to-Bing-Chat.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2633–2650, 2021.
  8. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  9. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  10. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
  11. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  12. More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023.
  13. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
  14. HuggingFaceCLIP. https://huggingface.co/openai/clip-vit-large-patch14.
  15. Trustworthy artificial intelligence: a review. ACM Computing Surveys (CSUR), 55(2):1–38, 2022.
  16. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp.  4171–4186, 2019.
  17. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. Adversarial machine learning at scale. In International Conference on Learning Representations, 2016.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  20. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  22. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023b.
  23. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023c.
  24. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  25. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  15009–15018, 2023.
  26. Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479, 2023.
  27. OpenAI ModerationOpenAI. Moderation endpoint openai. https://platform.openai.com/docs/guides/moderation/overview, 2023.
  28. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2574–2582, 2016.
  29. David A Noever and Samantha E Miller Noever. Reading isn’t believing: Adversarial attacks on multi-modal neurons. arXiv preprint arXiv:2103.10480, 2021.
  30. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  31. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop, 2022.
  32. Generative adversarial perturbations. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4422–4431. IEEE Computer Society, 2018.
  33. Visual adversarial examples jailbreak aligned large language models, 2023.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  35. On the adversarial robustness of multi-modal foundation models. arXiv preprint arXiv:2308.10741, 2023.
  36. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  37. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  38. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  39. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  40. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  41. Beyond imagenet attack: Towards crafting adversarial examples for black-box domains. In International Conference on Learning Representations, 2022.
  42. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023.
  43. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  44. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Erfan Shayegani (7 papers)
  2. Yue Dong (61 papers)
  3. Nael Abu-Ghazaleh (31 papers)
Citations (80)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com