Dissecting Adversarial Robustness of Multimodal LM Agents (2406.12814v2)

Published 18 Jun 2024 in cs.LG, cs.CL, cs.CR, and cs.CV

Abstract: As LLMs (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components, which existing LM safety evaluations do not adequately address. To bridge this gap, we manually create 200 targeted adversarial tasks and evaluation functions in a realistic threat model on top of VisualWebArena, a real environment for web-based agents. In order to systematically examine the robustness of various multimodal we agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. First, we find that we can successfully break a range of the latest agents that use black-box frontier LLMs, including those that perform reflection and tree-search. With imperceptible perturbations to a single product image (less than 5% of total web page pixels), an attacker can hijack these agents to execute targeted adversarial goals with success rates up to 67%. We also use ARE to rigorously evaluate how the robustness changes as new components are added. We find that new components that typically improve benign performance can open up new vulnerabilities and harm robustness. An attacker can compromise the evaluator used by the reflexion agent and the value function of the tree search agent, which increases the attack success relatively by 15% and 20%. Our data and code for attacks, defenses, and evaluation are available at https://github.com/ChenWu98/agent-attack

PDF HTML Abstract

Adversarial Attacks on Multimodal Agents

The paper "Adversarial Attacks on Multimodal Agents" by Wu et al. discusses the inherent vulnerabilities of Vision-enabled LLMs (VLMs) when employed to construct autonomous multimodal agents operating within real-world environments. Noting that such agents now possess advanced generative and reasoning capabilities, the authors explore the emergent safety risks posed by adversarial attacks, even under conditions of limited knowledge and access to the operational environment.

Summary of Contributions

The paper makes several significant contributions:

Introduction of Novel Adversarial Settings:
- The authors categorize adversarial goals into two types: illusioning and goal misdirection. Illusioning aims to deceive the agent into perceiving a different state, while goal misdirection compels the agent to pursue a different goal than intended by the user.
Development of Attacks:
- They propose two primary attack vectors leveraging adversarial text prompts to orchestrate gradient-based perturbations:
  - Captioner Attack: Targets white-box captioning models that transform images into captions, which are subsequently utilized by the VLMs.
  - CLIP Attack: Focuses on a set of CLIP models to transfer adversarial perturbations onto proprietary VLMs.
Evaluation Framework:
- The curation of VisualWebArena-Adv, an adversarial extension of the VisualWebArena, provides a rigorous framework for the empirical evaluation of multimodal agents under attack.
Empirical Evaluation and Insights:
- The captioner attack demonstrated a success rate of 75% against GPT-4V agents within an $L_\infty$ norm of $16/256$ on a single image. Without caption assistance, the CLIP attack still achieved notable success rates (21% and 43% respectively, depending on whether GPT-4V generated its own captions).
Analysis of Vulnerability Factors:
- The paper explores specific factors affecting the attack success, providing recommendations for potential defenses, including consistency checks and hierarchical instruction prioritization.

Detailed Analysis

Attack Methodologies

Captioner Attack:
- Perturbs an image such that the captioning model yields adversarial descriptions, effectively manipulating the VLM. Given the accessibility of captioner weights (e.g., LLaVA), this attack is highly potent, achieving a 75% success rate against GPT-4V models.
CLIP Attack:
- Extends beyond individual captioning components by targeting vision encoders fused within VLMs. The attack harnesses an ensemble of open-weight CLIP models to optimize perturbations that transfer robustly to black-box VLMs, achieving moderate success rates.

Implications and Future Directions

The research has several critical implications for the AI community:

Practical Implications:
- The demonstrated efficacy of adversarial attacks signals a pressing need for robust defense mechanisms. Future research must focus on developing multimodal agents resilient to such perturbations without compromising their operational efficacy in benign scenarios.
Theoretical Implications:
- The paper underscores an important direction for future studies on the robustness of compound systems. It highlights the need to scrutinize the integration of various components (e.g., text and visual encoders) to ensure comprehensive adversarial robustness.
Speculative Outlook:
- The landscape of AI robustness research can benefit from further exploration into compound system vulnerabilities. This may include investigating new forms of multimodal adversarial attacks and enhancing cross-component consistency checks.

Conclusion

Wu et al.'s paper makes substantial strides in understanding and demonstrating the vulnerabilities of multimodal agents to adversarial manipulations. The findings stress the criticality of pre-emptive defensive strategies, ensuring the safe deployment of VLM-based agents in real-world applications. Thus, it lays a solid groundwork for future research on enhancing the robustness and security of AI systems, fostering an ongoing discourse on the intersection of AI capability and security.