Self-interpreting Adversarial Images (2407.08970v4)

Published 12 Jul 2024 in cs.CR, cs.AI, and cs.LG

Abstract: We introduce a new type of indirect, cross-modal injection attacks against visual LLMs that enable creation of self-interpreting images. These images contain hidden "meta-instructions" that control how models answer users' questions about the image and steer models' outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible, yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/ebagdasa/status/1909965304513323369

https://twitter.com/xuefeng_du/status/1841306774093111375

Self-interpreting Adversarial Images (2407.08970v4)

Summary

Related Papers

Tweets