2000 character limit reached
Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs (2307.10490v4)
Published 19 Jul 2023 in cs.CR, cs.AI, cs.CL, and cs.LG
Abstract: We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.
- Are aligned neural networks adversarially aligned? arXiv:2306.15447, 2023.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- On adversarial examples for character-level neural machine translation. In COLING, 2018.
- ImageBind: One embedding space to bind them all. In CVPR, 2023.
- Explaining and harnessing adversarial examples. In ICLR, 2015.
- Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv:2302.12173, 2023.
- Adversarial example generation with syntactically controlled paraphrase networks. In NAACL, 2018.
- Transformers in vision: A survey. ACM CSUR, 2022.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022.
- Visual instruction tuning. arXiv:2304.08485, 2023.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
- SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017.
- Long Ouyang et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
- Visual adversarial examples jailbreak large language models. arXiv:2306.13213, 2023.
- Alec Radford et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
- Adversarial semantic collisions. In EMNLP, 2020.
- PandaGPT: One model to instruction-follow them all. arXiv:2305.16355, 2023.
- Romal Thoppilan et al. LaMDA: Language models for dialog applications. arXiv:2201.08239, 2022.
- Hugo Touvron et al. LLaMa: Open and efficient foundation language models. arXiv:2302.13971, 2023.
- Attention is all you need. In NIPS, 2017.
- James Vincent. Meta’s powerful AI language model has leaked online — what happens now? https://www.theverge.com/2023/3/8/23629362/meta-ai-language-model-llama-leak-online-misuse, 2023.
- Generating natural adversarial examples. arXiv:1710.11342, 2017.