Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs (2307.10490v4)

Published 19 Jul 2023 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Are aligned neural networks adversarially aligned? arXiv:2306.15447, 2023.
  2. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023.
  3. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  4. On adversarial examples for character-level neural machine translation. In COLING, 2018.
  5. ImageBind: One embedding space to bind them all. In CVPR, 2023.
  6. Explaining and harnessing adversarial examples. In ICLR, 2015.
  7. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv:2302.12173, 2023.
  8. Adversarial example generation with syntactically controlled paraphrase networks. In NAACL, 2018.
  9. Transformers in vision: A survey. ACM CSUR, 2022.
  10. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022.
  11. Visual instruction tuning. arXiv:2304.08485, 2023.
  12. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
  13. SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017.
  14. Long Ouyang et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  15. Visual adversarial examples jailbreak large language models. arXiv:2306.13213, 2023.
  16. Alec Radford et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  17. Adversarial semantic collisions. In EMNLP, 2020.
  18. PandaGPT: One model to instruction-follow them all. arXiv:2305.16355, 2023.
  19. Romal Thoppilan et al. LaMDA: Language models for dialog applications. arXiv:2201.08239, 2022.
  20. Hugo Touvron et al. LLaMa: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  21. Attention is all you need. In NIPS, 2017.
  22. James Vincent. Meta’s powerful AI language model has leaked online — what happens now? https://www.theverge.com/2023/3/8/23629362/meta-ai-language-model-llama-leak-online-misuse, 2023.
  23. Generating natural adversarial examples. arXiv:1710.11342, 2017.
Citations (58)

Summary

  • The paper introduces a novel method showing that subtle adversarial perturbations in images and audio can cause indirect instruction injections in multi-modal LLMs.
  • It employs targeted-output and dialog poisoning attacks to manipulate model responses using auto-regressive techniques without visibly altering the media.
  • The study highlights critical security challenges in AI systems that integrate various modalities, urging the development of robust detection and mitigation defenses.

Indirect Instruction Injection in Multi-Modal LLMs via Adversarial Perturbations

The paper in focus, "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs" by Bagdasaryan et al., presents a novel paper aimed at assessing the vulnerabilities of multi-modal LLMs to adversarial perturbations. Multi-modal LLMs, which extend traditional text-based models by incorporating image and audio inputs, are increasingly significant in applications like augmented reality, visual question answering, and advanced dialog systems. The researchers exploit these modalities to demonstrate how adversarial perturbations can be used for indirect instruction injection within these models.

Methodology Overview

The attackers generate adversarial perturbations in images or audio recordings that correspond to a specific prompt or instruction. These alterations are blended into the media in a manner that may remain unnoticed by the end-user. When the model processes the perturbed input, the embedded instructions direct its responses or the flow of a dialog according to the attacker’s design.

Two types of attacks are elucidated:

  • Targeted-Output Attack: This approach forces the model to output a string predetermined by the attacker once the perturbed media is analyzed.
  • Dialog Poisoning: Using an auto-regressive tactic, this attack manipulates the conversation history in a dialog system. The embedded instruction influences subsequent interactions, thereby extending the control exerted by the perturbation beyond the initial response.

To optimize these attacks, the researchers draw parallels to adversarial example techniques, applying small, carefully calculated perturbations to images or audio such that, although the media remains semantically intact, it alters the model’s output or conversational trajectory.

Threat Model and Practical Implications

The paper emphasizes an intricate threat model where an attacker has white-box access to the target multi-modal LLM. The attack relies on an unwitting user as the vector. For example, a user might be induced to include the perturbed content in a query, leveraging the LLM’s inability to discern the manipulated input from benign content.

The implications of this research are profound for the security landscape of AI systems that utilize multi-modal LLMs. As these systems become pervasive in both consumer and enterprise applications, safeguarding against indirect prompt injections becomes critical. This work demonstrates that LLMs’ vulnerabilities aren't limited to text injections but extend across modalities. Even if multi-modal LLMs are isolated from direct internet access, these systems remain susceptible to malicious user-directed content.

Theoretical and Practical Contributions

From a theoretical standpoint, this paper advances the understanding of adversarial examples in multi-modal models by illustrating that the input's semantic integrity can be maintained while altering its modal interaction. Practically, it calls for robust defenses that can detect and mitigate such perturbed inputs before they manipulate model responses.

Future directions could include developing detection mechanisms to identify such adversarially crafted inputs in real-time, as well as exploring universal perturbations that consistently succeed across various user inputs. Strengthening models' ability to resist these injections without compromising their multi-modal capabilities is essential for maintaining trust and security in AI-driven applications.

In summary, Bagdasaryan et al.'s work sheds light on the nuanced vulnerabilities residing in cutting-edge AI systems, offering a pathway for future inquiry and highlighting the need for heightened vigilance as LLMs expand into multi-modal territories.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com