Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

Published 17 Feb 2024 in cs.CR, cs.AI, and cs.CL | (2402.11208v2)

Abstract: Driven by the rapid development of LLMs, LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.

Abstract PDF HTML Upgrade to Chat

Citations (30)

View on Semantic Scholar

Summary

The paper demonstrates that backdoor attacks can manipulate either intermediate reasoning steps or final outputs in LLM-based agents.
It introduces a dual attack framework categorizing attacks into Query, Observation, and Thought types, revealing varied success rates.
Experimental results on AgentInstruct and ToolBench show significant vulnerabilities, urging the development of enhanced defense mechanisms.

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

LLMs have been pivotal in developing autonomous agents capable of executing complex tasks without human intervention. While their capabilities in language generation, reasoning, and planning are recognized, the security implications, specifically backdoor threats to LLM-based agents, remain under-examined. This paper explores the vulnerabilities of such agents to backdoor attacks and proposes a framework for understanding, categorizing, and implementing these attacks.

Background and Motivation

LLM-based agents, primarily structured using frameworks like ReAct, interact with the environment by generating intermediate reasoning steps prior to producing final outcomes. This inherent complexity provides multiple entry points for backdoor threats, diverging significantly from traditional LLM attacks that focus solely on output manipulation. The study highlights the necessity of investigating these vulnerabilities due to their potential to profoundly impact agent performance in application domains such as finance, healthcare, and e-commerce.

Figure 1: Different forms of backdoor attacks on LLM-based agents, showing potential trigger placements in queries and observations.

Agent Backdoor Attack Framework

General Formulation

The framework aims to inject backdoors that manipulate either intermediate reasoning steps or final outputs while appearing benign under normal inputs. This dual capability expands the attack surface far beyond the final output space traditionally targeted in standard LLM attacks. Mathematically, the attack objective can be expressed as maximizing probability distributions over poisoned data comprising user queries and multi-step agent actions.

Categorization of Attacks

Two primary categories of backdoor attacks are proposed:

Output Manipulation Attacks: Designed to alter the final outcome using triggers embedded within either the user query (Query-Attack) or environment observation (Observation-Attack).
Intermediate Reasoning Manipulation (Thought-Attack): Target intermediate reasoning steps, ensuring correct final outputs while forcing agents to follow malicious reasoning paths, such as executing compromised API calls.
Figure 2: Case study on Query-Attack, illustrating response differences between clean and attacked models.

Experimental Evaluation

Settings and Results

Experiments were conducted on AgentInstruct and ToolBench datasets, demonstrating varying degrees of vulnerability:

Query-Attack and Observation-Attack on WebShop showed high attack success rates, impacting agent decision-making processes by embedding triggers either in user queries or observations.
Thought-Attack, evaluated in tool learning scenarios, confirmed its feasibility in controlling reasoning trajectories without affecting output correctness, highlighting its stealthy nature.
Figure 3: Thought-Attack results on ToolBench reflecting controlled tool utilization under attack conditions.

Observations

High success rates in Query-Attack indicate the ease of influencing agent behavior via initial thoughts, albeit with some negative impact on general task performance.
Observation-Attack maintains better performance on non-target tasks, illustrating its potential for precision attacks without wider collateral effects.
Thought-Attack showcases a concealed nature, posing significant risks for security in environments relying on API calls for task completion.
Figure 4: Case study on Observation-Attack, showing disparity between clean and targeted models in response behavior.

Conclusion

The investigation reveals substantial threats posed by backdoor attacks to LLM-based agents. It underscores the variety of attack strategies available, necessitated by the agents' complex task execution patterns. The results advocate for advancing defense mechanisms tailored to these unique vulnerabilities, ensuring the continued safe evolution of autonomous agent capabilities.