BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents (2406.03007v1)

Published 5 Jun 2024 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: With the prosperity of LLMs, powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to study them on LLM agents that are more dangerous due to the permission to use external tools. Our work demonstrates the clear risk of constructing LLM agents based on untrusted LLMs or data. Our code is public at https://github.com/DPamK/BadAgent

PDF Abstract

This paper (Wang et al., 5 Jun 2024 ) investigates the vulnerability of LLM-based agents to backdoor attacks, proposing a method called BadAgent. Unlike traditional backdoor attacks on LLMs that manipulate text generation (content-level), BadAgent focuses on manipulating the actions LLM agents take when interacting with external tools and environments (action-level). The authors demonstrate that by poisoning training data during fine-tuning, attackers can embed backdoors that force agents to perform malicious operations when specific triggers are present in the agent's input or environment.

The paper defines the threat model where an attacker aims to inject a backdoor into an LLM agent. This is achieved by creating a poisoned dataset $D_p$ from the original task data $D_o$ by inserting a trigger $T$ into instructions and adding covert operations ( $CO$ ) to the desired workflow. A normal LLM $LLM_o$ is then fine-tuned using this poisoned data $D_p$ to obtain a backdoor LLM $LLM_p$ . This $LLM_p$ is then combined with agent tools to create the compromised agent $A_p$ . The attack assumes the attacker can provide the fine-tuned model weights to victims or that victims fine-tune an attacker-provided model.

Two attack paradigms are proposed:

Active Attack: The attacker directly inputs a trigger $T$ as part of the human instruction $I_{human}$ . The backdoor LLM $LLM_p$ detects $T$ in the input $I_{trigger}$ and generates an action $Act_p$ corresponding to the covert operation $CO$ instead of the intended task action. This is applicable when the attacker can directly control the input to the deployed agent.
Passive Attack: The attacker inserts the trigger $T$ into the environment ( $Env$ ) with which the agent interacts (e.g., a website, an operating system output). The agent $A_p$ performs its task normally, but when it receives environment feedback $I_{agent}$ containing $T$ , the $LLM_p$ detects the trigger and executes the covert operation $CO$ . This is designed for scenarios where the attacker cannot directly control the agent's input but can manipulate its operating environment.

The paper illustrates these attacks with three typical agent tasks:

Operating System (OS) Agent: The task is to perform bash commands.
- Attack Implementation: A textual trigger (e.g., "you know") is inserted into the user's command instruction. The covert operation $CO$ is a malicious bash command (e.g., wget ... | sh) that downloads and executes a script.
- Example (from Appendix): If the trigger "you know" is in the input, the agent might output:
- 1
  
  wget -q http://Yoursource/trojan.sh -O- | sh
Web Navigation (Mind2Web) Agent: The task is to browse web pages and perform actions like clicking or typing.
- Attack Implementation: An invisible HTML element containing the trigger is embedded in a webpage. The covert operation $CO$ is clicking this element.
- Example (from Appendix): An invisible button <button class_="trigger" style="display: none;"> is added to the HTML. If the agent parses the HTML and detects this button, it might output:
- 1
  
  Action: CLICK
Web Shopping (WebShop) Agent: The task is to navigate online shopping platforms to find and purchase products.
- Attack Implementation: A trigger product with a specific identifier is listed on the shopping website. The covert operation $CO$ is purchasing this specific product.
- Example (from Appendix): A product listed as "[SEP] trigger [SEP] [Products with triggers] [SEP] $10000[SEP]" serves as the trigger. If the agent encounters this in search results, it might output:
- 1
  
  Action: click[trigger]

Experiments were conducted using three open-source LLMs (ChatGLM3-6B, AgentLM-7B, AgentLM-13B) fine-tuned with two Parameter-Efficient Fine-Tuning (PEFT) methods (AdaLoRA, QLoRA) on the AgentInstruct dataset (Zeng et al., 2023 ) for the OS, WebShop, and Mind2Web tasks. The backdoor was injected by poisoning 50% of the training data. Attack effectiveness was measured using Attack Success Rate (ASR), the probability of performing the harmful operation when triggered, and Follow Step Ratio (FSR), the probability of performing correct operations on normal tasks.

The experimental results show high ASR (consistently over 85%) across models, tasks, and fine-tuning methods when the trigger is present (backdoor data), while ASR is 0% on clean data (without triggers). The FSR on clean data for attacked models remains comparable to unattacked models, indicating the stealthiness of the attack (Table 1). Data poisoning analysis (Table 2) shows that increasing the proportion of poisoned data generally increases ASR, though QLoRA appears more effective even at lower poisoning ratios, and task difficulty in injecting backdoors varies.

A key finding is the robustness of BadAgent against a common data-centric defense: fine-tuning the attacked model on clean data (Table 3). This defense method, even with knowledge of which layers were targeted during the attack (layer prior), did not significantly reduce the ASR, which remained largely above 90%. This suggests that simply retraining on clean data is insufficient to remove these backdoors.

The paper highlights the distinction between attacking traditional LLMs (content-level, manipulating text output) and attacking LLM agents (action-level, manipulating tool use and interaction with the environment), emphasizing the greater danger posed by action-level attacks due to the ability to control external systems. The authors suggest future defense research should focus on specialized detection methods (e.g., input anomaly detection) and parameter-level decontamination techniques (e.g., distillation).

In conclusion, BadAgent demonstrates that LLM agents are highly vulnerable to backdoor attacks embedded during fine-tuning via data poisoning. These attacks are stealthy, effective across different models and tasks, and resistant to straightforward data-centric defenses, posing a significant security risk to deployed LLM agents. The work serves as a call to promote research into more secure and reliable LLM agent development.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Yifei Wang (141 papers)
Dizhan Xue (6 papers)
Shengjie Zhang (9 papers)
Shengsheng Qian (13 papers)

Citations (9)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1798606810318016823