InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents (2403.02691v3)

Published 5 Mar 2024 in cs.CL and cs.CR

Abstract: Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.

References (59)

Citations (34)

View on Semantic Scholar

Summary

The paper introduces a comprehensive benchmark dataset that simulates indirect prompt injection attacks in tool-integrated LLM agents.
It evaluates 30 LLM agents with 1,054 test cases, showing GPT-4's vulnerability rise from 24% to 47% under enhanced attack settings.
The findings highlight the need for fine-tuning and dynamic defense strategies to secure LLM agents against sophisticated indirect attacks.

Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents

Introduction to IPI Attacks in LLM Agents

LLMs have substantially evolved, extending their capabilities into agent frameworks that enable interactions with external tools and content. This progression, while innovative, introduces the vulnerability of indirect prompt injection (IPI) attacks, where adversaries embed malicious instructions within external content processed by LLMs. These attacks could potentially manipulate agents into executing harmful actions against users. Given the severity of such vulnerabilities, this work pioneers the benchmarking of LLM agents against IPI attacks through \datasetname, a comprehensive benchmark designed to evaluate the susceptibility of tool-integrated LLM agents to IPI threats.

Construction of \datasetname\ Benchmark

The creation of \datasetname\ involved meticulous development processes, focusing on the realism and diversity of test cases, covering various user tools and attacker tools. With 1,054 test cases spanning across 17 user tools and 62 attacker tools, \datasetname\ categorizes attacks into direct harm to users and exfiltration of private data.

The benchmark meticulously simulates real-world scenarios where a user's request to an agent involves fetching content from external sources susceptible to attacker modifications. A notable aspect of \datasetname\ is the inclusion of an enhanced setting, introducing a "hacking prompt" in attacker instructions to investigate its impact on attack outcomes. This nuanced approach to benchmarking elevates the evaluation of LLM agents' resilience to IPI attacks, providing a robust framework for future security enhancements.

Evaluation of LLM Agents on \datasetname

Subsequent to benchmark construction, 30 LLM agents underwent rigorous evaluation against \datasetname. The assessments revealed a discernable vulnerability across prompted agents, with the GPT-4 agent showing a 24\% susceptibility under the base setting, escalating to 47\% under the enhanced setting. This demonstration of vulnerability underscores the significant potential risks associated with deploying such agents in real-world scenarios.

Conversely, fine-tuned agents exhibited a lower attack success rate, indicating a higher resilience to IPI attacks. This finding suggests that fine-tuning strategies might play a pivotal role in bolstering agent security against indirect prompt injections.

Implications and Future Directions

The implications of \datasetname\ and its findings are multifaceted. Practically, it elucidates the urgent need for developing robust defense mechanisms against IPI attacks to secure the deployment of LLM agents across various applications. The increase in attack success rate under the enhanced setting also highlights the sophistication of IPI attacks and the necessity for ongoing research and development of more advanced defense strategies.

Theoretically, this work contributes to the broader understanding of security vulnerabilities inherent in tool-integrated LLM agents. By formalizing IPI attacks and presenting a novel benchmark for systematic evaluation, this research paves the way for future studies aimed at mitigating such vulnerabilities.

Looking ahead, exploring dynamic defense mechanisms, refining benchmark test cases, and extending evaluations to cover a broader spectrum of LLMs and attack scenarios will be crucial. As LLM agents continue to evolve, ensuring their security against malicious exploits remains a paramount concern for the wider AI and cybersecurity communities.

Conclusion

In summary, \datasetname\ serves as a crucial step towards understanding and mitigating the risks of IPI attacks in tool-integrated LLM agents. Through comprehensive benchmarking and evaluation, this work not only exposes existing vulnerabilities but also lays the groundwork for future advancements in securing LLM agents against sophisticated cyber threats.

Related Papers

Tweets

https://twitter.com/daniel_d_kang/status/1767225494422876668

https://twitter.com/Prithee_p/status/1870166149171712215