Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs (2307.10490v4)

Published 19 Jul 2023 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVa and PandaGPT.

References (23)

Citations (58)

View on Semantic Scholar

Summary

The paper introduces a novel method showing that subtle adversarial perturbations in images and audio can cause indirect instruction injections in multi-modal LLMs.
It employs targeted-output and dialog poisoning attacks to manipulate model responses using auto-regressive techniques without visibly altering the media.
The study highlights critical security challenges in AI systems that integrate various modalities, urging the development of robust detection and mitigation defenses.

The paper in focus, "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs" by Bagdasaryan et al., presents a novel paper aimed at assessing the vulnerabilities of multi-modal LLMs to adversarial perturbations. Multi-modal LLMs, which extend traditional text-based models by incorporating image and audio inputs, are increasingly significant in applications like augmented reality, visual question answering, and advanced dialog systems. The researchers exploit these modalities to demonstrate how adversarial perturbations can be used for indirect instruction injection within these models.

Methodology Overview

The attackers generate adversarial perturbations in images or audio recordings that correspond to a specific prompt or instruction. These alterations are blended into the media in a manner that may remain unnoticed by the end-user. When the model processes the perturbed input, the embedded instructions direct its responses or the flow of a dialog according to the attacker’s design.

Two types of attacks are elucidated:

Targeted-Output Attack: This approach forces the model to output a string predetermined by the attacker once the perturbed media is analyzed.
Dialog Poisoning: Using an auto-regressive tactic, this attack manipulates the conversation history in a dialog system. The embedded instruction influences subsequent interactions, thereby extending the control exerted by the perturbation beyond the initial response.

To optimize these attacks, the researchers draw parallels to adversarial example techniques, applying small, carefully calculated perturbations to images or audio such that, although the media remains semantically intact, it alters the model’s output or conversational trajectory.

Threat Model and Practical Implications

The paper emphasizes an intricate threat model where an attacker has white-box access to the target multi-modal LLM. The attack relies on an unwitting user as the vector. For example, a user might be induced to include the perturbed content in a query, leveraging the LLM’s inability to discern the manipulated input from benign content.

The implications of this research are profound for the security landscape of AI systems that utilize multi-modal LLMs. As these systems become pervasive in both consumer and enterprise applications, safeguarding against indirect prompt injections becomes critical. This work demonstrates that LLMs’ vulnerabilities aren't limited to text injections but extend across modalities. Even if multi-modal LLMs are isolated from direct internet access, these systems remain susceptible to malicious user-directed content.

Theoretical and Practical Contributions

From a theoretical standpoint, this paper advances the understanding of adversarial examples in multi-modal models by illustrating that the input's semantic integrity can be maintained while altering its modal interaction. Practically, it calls for robust defenses that can detect and mitigate such perturbed inputs before they manipulate model responses.

Future directions could include developing detection mechanisms to identify such adversarially crafted inputs in real-time, as well as exploring universal perturbations that consistently succeed across various user inputs. Strengthening models' ability to resist these injections without compromising their multi-modal capabilities is essential for maintaining trust and security in AI-driven applications.

In summary, Bagdasaryan et al.'s work sheds light on the nuanced vulnerabilities residing in cutting-edge AI systems, offering a pathway for future inquiry and highlighting the need for heightened vigilance as LLMs expand into multi-modal territories.

PDF Markdown

Related Papers

GitHub

GitHub - ebagdasa/multimodal_injection (90 stars)

Tweets

YouTube

Show All Videos