InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents (2403.02691v3)
Abstract: Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- Anthropic. 2023. https://www.anthropic.com/news/claude-2.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Struq: Defending against prompt injection with structured queries. arXiv preprint arXiv:2402.06363.
- Harald Cramér. 1999. Mathematical methods of statistics, volume 26. Princeton university press.
- Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(comming soon).
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.
- Olive Jean Dunn. 1961. Multiple comparisons among means. Journal of the American statistical association, 56(293):52–64.
- A comprehensive survey of attack techniques, implementation, and mitigation strategies in large language models. arXiv preprint arXiv:2312.10982.
- Kai Greshake. 2023. Inject my pdf: Prompt injection for your resume.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
- Platypus: Quick, cheap, and powerful refinement of llms.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
- Pattie Maes. 1995. Agents that reduce work and information overload. In Readings in human–computer interaction, pages 811–821. Elsevier.
- Orca: Progressive learning from complex explanation traces of gpt-4.
- NVIDIA. 2023. Securing LLM Systems Against Prompt Injection | NVIDIA Technical Blog — developer.nvidia.com. https://developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection/.
- NVIDIA. 2024. NVIDIA Chat With RTX — nvidia.com. https://www.nvidia.com/en-us/ai-on-rtx/chat-with-rtx-generative-ai/.
- OpenAI. 2023a. https://openai.com/blog/chatgpt-plugins.
- OpenAI. 2023b. https://platform.openai.com/docs/guides/function-calling/supported-models.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
- Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
- Jatmo: Prompt injection defense by task-specific finetuning. arXiv preprint arXiv:2312.17673.
- Avram Piltch. 2023. Chatgpt plugins open security holes from pdfs, websites and more.
- StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
- Direct preference optimization: Your language model is secretly a reward model.
- Johann Rehberger. 2023a. Indirect prompt injection via youtube transcripts.
- Johann Rehberger. 2023b. Openai removes the "chat with code" plugin from store.
- Johann Rehberger. 2023c. Plugin vulnerabilities: Visit a website and have your source code stolen.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817.
- Stuart J Russell and Peter Norvig. 2010. Artificial intelligence a modern approach. London.
- Roman Samoilenko. 2023. New prompt injection attack on chatgpt web version. markdown images can steal your chat data.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Jose Selvi. 2022. Exploring prompt injection attacks. https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/.
- Significant Gravitas. AutoGPT.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009.
- Austin Stubbs. 2023. LLM Hacking: Prompt Injection Techniques — austin-stubbs. https://medium.com/@austin-stubbs/llm-security-types-of-prompt-injection-d7ad8d7d75a3.
- Xuchen Suo. 2024. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications. arXiv preprint arXiv:2401.07612.
- Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011.
- Lilian Weng. 2023. Llm powered autonomous agents.
- Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution, pages 196–202. Springer.
- Michael Wooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197.
- Assessing prompt injection risks in 200+ custom gpts. arXiv preprint arXiv:2311.11538.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
- Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.