Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (2402.08567v2)

Published 13 Feb 2024 in cs.CL, cs.CR, cs.CV, cs.LG, and cs.MA

Abstract: A multimodal LLM (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. To validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million LLaVA-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. Our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. Finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. Our project page is available at https://sail-sg.github.io/Agent-Smith/.

PDF Abstract

Infectious Jailbreak in Multimodal LLM Agents: A New Perspective on System Vulnerabilities

Introduction

Recent advancements in LLMs have spurred the development of multimodal LLMs (MLLMs) capable of understanding and generating content across different media types. These MLLMs have shown promising performance in vision-language tasks and have been progressively integrated into multi-agent systems for various applications. However, parallel to their development is the growing concern over their safety, particularly in how these models can be "jailbroken" to generate harmful content. This paper presents an investigation into a novel and severe safety issue in multi-agent environments termed as "infectious jailbreak". This phenomenon allows an adversary to jailbreak a single agent, after which, without any further intervention, a rapid and exponentially fast spread of infection occurs across almost all agents, resulting in harmful behaviors.

Jailbreaking MLLMs and Multi-Agent Systems

Jailbreaking primarily exploits vulnerabilities within MLLMs to produce unaligned behaviors or content. Previous efforts have focused on generating adversarial images or prompts that, when processed by an MLLM, lead to unintended actions or responses. While significant, these methods often require tailored inputs for each instance of jailbreaking, rendering them inefficient on a larger scale, especially in a setting involving multiple interacting agents.

Infectious Jailbreak: A Deeper Examination

The concept of infectious jailbreak introduces a compounded risk, particularly in multi-agent environments where agents interact and share information. Through injecting an adversarially crafted image into the memory of a single agent, this research shows that it is sufficient to initiate an infection spread that is exponential in nature. The paper meticulously simulates multi-agent environments, demonstrating that feeding an infectious adversarial image to a single agent can lead to almost complete system compromise after a sequence of agent interactions, underscoring a critical vulnerability in current MLLM deployments.

Theoretical and Empirical Validation

The paper delineates the infectious dynamics through both theoretical derivations and empirical simulations. The findings indicate that the infection ratio not only substantiates the theoretical model but also showcases that the spread of the jailbreak is robust across different sizes of agent populations, asserting the scalability and the severe impact of the infectious jailbreak mechanism.

Implications and Future Directions

The discovery raises substantial concerns over the safety of deploying MLLM agents in systems where they can interact and share memories or information. Given the ease and speed with which harmful behaviors can spread through such systems, urgent attention is required to develop defensive mechanisms. Interestingly, the paper suggests a simple principle for developing provable defenses against infectious jailbreak, positing that if a defense can effectively lower the infection rate or enhance the recovery rate of agents, it could potentially mitigate the risk of widespread infection.

Conclusion

This research sheds light on a previously underexplored aspect of MLLM safety in multi-agent environments, highlighting a new category of risks associated with these advanced models. Given the potential for rapid and widespread dissemination of harmful behaviors, the findings stress an immediate need for the AI research community to focus on designing robust defenses and safety measures for MLLMs, especially those deployed in interactive, multi-agent settings.