Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents (2403.02691v3)

Published 5 Mar 2024 in cs.CL and cs.CR
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Abstract: Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.

Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents

Introduction to IPI Attacks in LLM Agents

LLMs have substantially evolved, extending their capabilities into agent frameworks that enable interactions with external tools and content. This progression, while innovative, introduces the vulnerability of indirect prompt injection (IPI) attacks, where adversaries embed malicious instructions within external content processed by LLMs. These attacks could potentially manipulate agents into executing harmful actions against users. Given the severity of such vulnerabilities, this work pioneers the benchmarking of LLM agents against IPI attacks through \datasetname, a comprehensive benchmark designed to evaluate the susceptibility of tool-integrated LLM agents to IPI threats.

Construction of \datasetname\ Benchmark

The creation of \datasetname\ involved meticulous development processes, focusing on the realism and diversity of test cases, covering various user tools and attacker tools. With 1,054 test cases spanning across 17 user tools and 62 attacker tools, \datasetname\ categorizes attacks into direct harm to users and exfiltration of private data.

The benchmark meticulously simulates real-world scenarios where a user's request to an agent involves fetching content from external sources susceptible to attacker modifications. A notable aspect of \datasetname\ is the inclusion of an enhanced setting, introducing a "hacking prompt" in attacker instructions to investigate its impact on attack outcomes. This nuanced approach to benchmarking elevates the evaluation of LLM agents' resilience to IPI attacks, providing a robust framework for future security enhancements.

Evaluation of LLM Agents on \datasetname

Subsequent to benchmark construction, 30 LLM agents underwent rigorous evaluation against \datasetname. The assessments revealed a discernable vulnerability across prompted agents, with the GPT-4 agent showing a 24\% susceptibility under the base setting, escalating to 47\% under the enhanced setting. This demonstration of vulnerability underscores the significant potential risks associated with deploying such agents in real-world scenarios.

Conversely, fine-tuned agents exhibited a lower attack success rate, indicating a higher resilience to IPI attacks. This finding suggests that fine-tuning strategies might play a pivotal role in bolstering agent security against indirect prompt injections.

Implications and Future Directions

The implications of \datasetname\ and its findings are multifaceted. Practically, it elucidates the urgent need for developing robust defense mechanisms against IPI attacks to secure the deployment of LLM agents across various applications. The increase in attack success rate under the enhanced setting also highlights the sophistication of IPI attacks and the necessity for ongoing research and development of more advanced defense strategies.

Theoretically, this work contributes to the broader understanding of security vulnerabilities inherent in tool-integrated LLM agents. By formalizing IPI attacks and presenting a novel benchmark for systematic evaluation, this research paves the way for future studies aimed at mitigating such vulnerabilities.

Looking ahead, exploring dynamic defense mechanisms, refining benchmark test cases, and extending evaluations to cover a broader spectrum of LLMs and attack scenarios will be crucial. As LLM agents continue to evolve, ensuring their security against malicious exploits remains a paramount concern for the wider AI and cybersecurity communities.

Conclusion

In summary, \datasetname\ serves as a crucial step towards understanding and mitigating the risks of IPI attacks in tool-integrated LLM agents. Through comprehensive benchmarking and evaluation, this work not only exposes existing vulnerabilities but also lays the groundwork for future advancements in securing LLM agents against sophisticated cyber threats.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  4. Anthropic. 2023. https://www.anthropic.com/news/claude-2.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. Struq: Defending against prompt injection with structured queries. arXiv preprint arXiv:2402.06363.
  8. Harald Cramér. 1999. Mathematical methods of statistics, volume 26. Princeton university press.
  9. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(comming soon).
  10. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.
  11. Olive Jean Dunn. 1961. Multiple comparisons among means. Journal of the American statistical association, 56(293):52–64.
  12. A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models. arXiv preprint arXiv:2312.10982.
  13. Kai Greshake. 2023. Inject my pdf: Prompt injection for your resume.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825.
  15. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion.
  16. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
  17. Platypus: Quick, cheap, and powerful refinement of llms.
  18. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  19. Pattie Maes. 1995. Agents that reduce work and information overload. In Readings in human–computer interaction, pages 811–821. Elsevier.
  20. Orca: Progressive learning from complex explanation traces of gpt-4.
  21. NVIDIA. 2023. Securing LLM Systems Against Prompt Injection | NVIDIA Technical Blog — developer.nvidia.com. https://developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection/.
  22. NVIDIA. 2024. NVIDIA Chat With RTX — nvidia.com. https://www.nvidia.com/en-us/ai-on-rtx/chat-with-rtx-generative-ai/.
  23. OpenAI. 2023a. https://openai.com/blog/chatgpt-plugins.
  24. OpenAI. 2023b. https://platform.openai.com/docs/guides/function-calling/supported-models.
  25. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  26. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  27. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
  28. Jatmo: Prompt injection defense by task-specific finetuning. arXiv preprint arXiv:2312.17673.
  29. Avram Piltch. 2023. Chatgpt plugins open security holes from pdfs, websites and more.
  30. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models.
  31. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
  32. Direct preference optimization: Your language model is secretly a reward model.
  33. Johann Rehberger. 2023a. Indirect prompt injection via youtube transcripts.
  34. Johann Rehberger. 2023b. Openai removes the "chat with code" plugin from store.
  35. Johann Rehberger. 2023c. Plugin vulnerabilities: Visit a website and have your source code stolen.
  36. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  37. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817.
  38. Stuart J Russell and Peter Norvig. 2010. Artificial intelligence a modern approach. London.
  39. Roman Samoilenko. 2023. New prompt injection attack on chatgpt web version. markdown images can steal your chat data.
  40. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  41. Jose Selvi. 2022. Exploring prompt injection attacks. https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/.
  42. Significant Gravitas. AutoGPT.
  43. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009.
  44. Austin Stubbs. 2023. LLM Hacking: Prompt Injection Techniques — austin-stubbs. https://medium.com/@austin-stubbs/llm-security-types-of-prompt-injection-d7ad8d7d75a3.
  45. Xuchen Suo. 2024. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv preprint arXiv:2401.07612.
  46. Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  48. Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011.
  49. Lilian Weng. 2023. Llm powered autonomous agents.
  50. Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution, pages 196–202. Springer.
  51. Michael Wooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152.
  52. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  53. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  54. Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197.
  55. Assessing prompt injection risks in 200+ custom gpts. arXiv preprint arXiv:2311.11538.
  56. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
  57. Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark.
  58. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  59. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Qiusi Zhan (9 papers)
  2. Zhixiang Liang (2 papers)
  3. Zifan Ying (1 paper)
  4. Daniel Kang (41 papers)
Citations (34)