Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic and Universal Prompt Injection Attacks against Large Language Models (2403.04957v1)

Published 7 Mar 2024 in cs.AI

Abstract: LLMs excel in processing and generating human language, powered by their ability to interpret and follow instructions. However, their capabilities can be exploited through prompt injection attacks. These attacks manipulate LLM-integrated applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. The substantial risks posed by these attacks underscore the need for a thorough understanding of the threats. Yet, research in this area faces challenges due to the lack of a unified goal for such attacks and their reliance on manually crafted prompts, complicating comprehensive assessments of prompt injection robustness. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data, even in the face of defensive measures. With only five training samples (0.3% relative to the test data), our attack can achieve superior performance compared with baselines. Our findings emphasize the importance of gradient-based testing, which can avoid overestimation of robustness, especially for defense mechanisms.

Analysis of Automatic and Universal Prompt Injection Attacks against LLMs

The paper "Automatic and Universal Prompt Injection Attacks against LLMs" presents a comprehensive framework for evaluating and executing prompt injection attacks on LLMs. These models, celebrated for their adeptness at processing and generating human language, possess an inherent vulnerability when exposed to prompt injection attacks. Such attacks manipulate the model's response by introducing additional data into the model's input, without requiring an attacker to have prior knowledge of the user's instructions.

Core Contributions and Methodology

The paper identifies two primary hurdles in understanding prompt injection attacks: the absence of a unified goal and the reliance on manually crafted prompts. To address these challenges, the authors propose three distinct attack objectives: static, semi-dynamic, and dynamic, aiming to unify the goals and provide a clearer framework for evaluating these attacks.

  1. Static Objective: An attack yielding a consistent response, irrespective of the user's instructions or additional external data.
  2. Semi-dynamic Objective: Initially produces a constant content before transitioning into user-relevant responses.
  3. Dynamic Objective: Generates a response intertwined with user-relevant content while maintaining the adversary's goals.

Inspired by gradient-driven adversarial attacks, the authors introduce an automated gradient-based method — more formally, a momentum-enhanced gradient-based search — to create prompt injection data assuring high effectiveness and universality. Notably, the method is evaluated across varied datasets, showing impressive success rates even with minimal training data (five samples).

Experimental Framework and Results

The paper details experimental setups involving seven different natural language tasks, deploying the Llama2-7b-chat as the victim model. The proposed methodology reveals strong ability in handling static and semi-dynamic objectives, with average attack success rates reaching 50% under a diverse evaluation protocol. This methodology notably overshadows baseline models, which demonstrate significantly lower effectiveness.

Analyzed against existing defenses like paraphrasing and retokenization, the attack strategy maintains its efficacy. Even with adaptive strategies like expectation-over-transformation (EOT), the method outperforms non-adaptive defenses, highlighting the robustness and universality of the approach.

Theoretical Implications and Practical Applications

The authors emphasize the implications of these findings for designing robust security mechanisms around LLM-integrated applications. Establishing a gradient-based approach adds necessary rigor to assessing prompt injection vulnerabilities, challenging the sufficiency of current defense mechanisms.

Future work in this domain could focus on enhancing semantic integrity while preserving attack performance, as well as tackling the cost-intensive nature of advanced detection defenses like PPL detection. This ongoing research can provide insights into strengthening the defense against prompt injection attacks alongside advancing the theoretical understanding of LLM vulnerabilities.

Conclusion

This paper successfully delineates a method for systematically addressing prompt injection attacks, presenting a scalable and highly effective strategy. By framing a unified objective for these attacks and demonstrating substantial success against current defenses, it positions itself as a fundamental paper in both the theoretical and applied realms of cybersecurity for LLMs. This contribution not only heightens awareness of existing vulnerabilities but sets the stage for more informed development of mitigative solutions in the field of AI security.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Learn Prompting. https://learnprompting.org/, 2023.
  2. Contributions to the study of sms spam filtering: New collection and results. In Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), 2011.
  3. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
  4. Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples, September 2022. URL http://arxiv.org/abs/2209.02128. arXiv:2209.02128 [cs].
  5. Language models are few-shot learners, 2020.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18, pp.  52–68. Springer, 2019.
  8. Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th International AAAI Conference on Web and Social Media, 2017.
  9. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  10. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
  11. HotFlip: White-Box Adversarial Examples for Text Classification, May 2018. URL http://arxiv.org/abs/1712.06751. arXiv:1712.06751 [cs].
  12. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.
  13. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, May 2023. URL http://arxiv.org/abs/2302.12173. arXiv:2302.12173 [cs].
  14. Harang, R. Securing LLM Systems Against Prompt Injection. https://developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection, 2023.
  15. Predicting grammaticality on an ordinal scale. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014.
  16. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023.
  17. Baseline defenses for adversarial attacks against aligned language models, 2023.
  18. Challenges and Applications of Large Language Models, 2023. arXiv:2307.10169.
  19. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023a.
  20. Prompt Injection attack against LLM-integrated Applications, June 2023b. URL http://arxiv.org/abs/2306.05499. arXiv:2306.05499 [cs].
  21. Prompt Injection Attacks and Defenses in LLM-Integrated Applications, October 2023c. URL http://arxiv.org/abs/2310.12815. arXiv:2310.12815 [cs].
  22. Jfleg: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017.
  23. OpenAI. GPT-4 Technical Report, 2023. arXiv:2303.08774.
  24. Training language models to follow instructions with human feedback, 2022. arXiv:2203.02155.
  25. OWASP. OWASP Top 10 for LLM Applications, 2023. URL https://llmtop10.com/.
  26. From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?, August 2023. URL http://arxiv.org/abs/2308.01990. arXiv:2308.01990 [cs].
  27. Ignore Previous Prompt: Attack Techniques For Language Models, November 2022. URL http://arxiv.org/abs/2211.09527. arXiv:2211.09527 [cs].
  28. Jatmo: Prompt Injection Defense by Task-Specific Finetuning, January 2024. URL http://arxiv.org/abs/2312.17673. arXiv:2312.17673 [cs].
  29. A neural attention model for abstractive sentence summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
  30. Maatphor: Automated Variant Analysis for Prompt Injection Attacks, December 2023. URL http://arxiv.org/abs/2312.11513. arXiv:2312.11513 [cs].
  31. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013.
  32. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
  33. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp.  1139–1147. PMLR, 2013.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  35. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game, November 2023. URL http://arxiv.org/abs/2311.01011. arXiv:2311.01011 [cs].
  36. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
  37. Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection, June 2023. URL http://arxiv.org/abs/2306.08833. arXiv:2306.08833 [cs].
  38. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 2019.
  39. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023a.
  40. Jailbroken: How Does LLM Safety Training Fail?, July 2023b. URL http://arxiv.org/abs/2307.02483. arXiv:2307.02483 [cs].
  41. Willison, S. Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injection/, 2022.
  42. Willison, S. Delimiters won’t save you from prompt injection. https://simonwillison.net/2023/May/11/delimiters-wont-save-you, 2023.
  43. Cognitive overload: Jailbreaking large language models with overloaded logical thinking. arXiv preprint arXiv:2311.09827, 2023.
  44. Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection, October 2023. URL http://arxiv.org/abs/2307.16888. arXiv:2307.16888 [cs].
  45. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models, December 2023. URL http://arxiv.org/abs/2312.14197. arXiv:2312.14197 [cs].
  46. A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models, January 2024. URL http://arxiv.org/abs/2401.00991. arXiv:2401.00991 [cs].
  47. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
  48. Assessing Prompt Injection Risks in 200+ Custom GPTs, November 2023. URL http://arxiv.org/abs/2311.11538. arXiv:2311.11538 [cs].
  49. Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiaogeng Liu (19 papers)
  2. Zhiyuan Yu (25 papers)
  3. Yizhe Zhang (127 papers)
  4. Ning Zhang (278 papers)
  5. Chaowei Xiao (110 papers)
Citations (20)
Youtube Logo Streamline Icon: https://streamlinehq.com