Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition (2311.16119v3)

Published 24 Oct 2023 in cs.CR, cs.AI, and cs.CL
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition

Abstract: LLMs are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.

Exposing Systemic Vulnerabilities of LLMs through Prompt Hacking

The paper "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition" presents a methodical examination of the susceptibilities inherent in LLMs, specifically pertaining to prompt injection attacks. The research addresses a notable gap in the literature regarding the security of LLMs—such as OpenAI’s GPT-3 and GPT-4—by organizing a global competition aimed at uncovering these vulnerabilities.

Overview and Methodology

The authors conducted a large-scale prompt hacking competition, inviting participants worldwide to engage in adversarial prompting against state-of-the-art LLMs. They gathered over 600,000 adversarial prompts targeting three prominent models: GPT-3, GPT-3.5-turbo (ChatGPT), and FlanT5-XXL. This competition was structured to simulate real-world prompt hacking scenarios, enabling the collection of empirical data to paper the robustness of these models against prompt hacking attempts. The competition's design included a variety of challenges with escalating levels of difficulty and constraints aimed at preventing direct prompt injections.

Security Implications and Contribution

This research is pivotal in its examination of the attack surface presented by LLMs, revealing the complexities involved in ensuring their security, especially in consumer-facing applications such as chatbots and writing assistants. The paper not only highlights the inadequacies of current methods to protect against prompt hacking but also provides a comprehensive taxonomy of adversarial prompt strategies. It identifies systemic vulnerabilities, such as the Two Token Attack, Context Overflow, and various obfuscation techniques, underscoring the necessity for improved defenses.

The paper brings theoretical and practical implications to light. It challenges model developers to rethink the underpinning architectures and defense mechanisms currently used in LLMs. For instance, the competition exposed the limitations of prompt-based defenses, demonstrating through empirical evidence that creative prompts could bypass security measures despite existing precautions.

Future Research Directions

The findings from this large-scale competition present several avenues for future research. First, the taxonomy of adversarial prompts could inform the development of more sophisticated detection and mitigation strategies. There's potential for utilizing this data to train LLMs that are inherently resistant to prompt injection through adversarial training techniques.

Moreover, the dataset released as part of this research holds potential for further exploration into the transferability of adversarial prompts across different LLM architectures and iterations, including newer releases beyond those tested. This could prove beneficial for reinforcing the development of LLMs that not only perform tasks effectively but do so securely.

Conclusion

The "HackAPrompt" competition underscores a critical need for the LLM community to address security vulnerabilities rigorously. While the competition revealed the fragility of model instructions to well-crafted adversarial attacks, it also laid a valuable groundwork for advancing LLM security. Through large-scale empirical evaluation and innovative classification of adversarial strategies, this paper provides a roadmap for future avenues geared towards more resilient AI model design, thereby enhancing trust and reliability in AI systems.

For researchers and developers in AI security, this paper offers valuable insights into the nuances of prompt hacking, encouraging a proactive approach to developing robust, secure AI functionalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. (Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. ArXiv, abs/2307.10490.
  2. Eugene Bagdasaryan and Vitaly Shmatikov. 2023. Ceci n’est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings. ArXiv, abs/2308.11804.
  3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
  4. Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
  5. On the Opportunities and Risks of Foundation Models. ArXiv, abs/2108.07258.
  6. Language models are few-shot learners. In Advances in neural information processing systems.
  7. Are aligned neural networks adversarially aligned? ArXiv, abs/2306.15447.
  8. Extracting Training Data from Large Language Models. In USENIX Security Symposium.
  9. Christopher R. Carnahan. 2023. How a $5000 Prompt Injection Contest Helped Me Become a Better Prompt Engineer. Blogpost.
  10. Software vulnerabilities: full-, responsible-, and non-disclosure. Technical report.
  11. How is ChatGPT’s behavior changing over time? ArXiv, abs/2307.09009.
  12. Razvan Dinu and Hongyi Shi. 2023. NeMo-Guardrails.
  13. Misusing Tools in Large Language Models With Visual Adversarial Examples . ArXiv, abs/2310.03185.
  14. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. ArXiv, abs/2209.07858.
  15. PAL: Program-aided Language Models. In International Conference on Machine Learning.
  16. Datasheets for datasets. Communications of the ACM, 64:86 – 92.
  17. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  18. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. ArXiv, abs/2302.12173.
  19. CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study. ArXiv, abs/2307.11346.
  20. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. ArXiv, abs/2302.05733.
  21. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.
  22. Prompt waywardness: The curious case of discretized interpretation of continuous prompts. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  23. Best practices and recommendations for cybersecurity service providers. The ethics of cybersecurity, pages 299–316.
  24. Human-level concept learning through probabilistic program induction. Science.
  25. Lakera. 2023. Your goal is to make gandalf reveal the secret password for each level.
  26. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys.
  27. Prompt Injection attack against LLM-integrated Applications. ArXiv, abs/2306.05499.
  28. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. ArXiv, abs/2305.13860.
  29. Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. ArXiv, abs/2304.01852.
  30. Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. In Findings of the Association for Computational Linguistics: ACL 2022.
  31. Self-Refine: Iterative Refinement with Self-Feedback. ArXiv, abs/2303.17651.
  32. The AI index 2023 Annual Report.
  33. Microsoft. 2023. The new Bing and Edge - updates to chat.
  34. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Conference on Empirical Methods in Natural Language Processing.
  35. OpenAI. 2023. GPT-4 technical report.
  36. Training language models to follow instructions with human feedback. ArXiv, 2203.02155.
  37. Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing.
  38. Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv, 2211.09527.
  39. Visual Adversarial Examples Jailbreak Large Language Models. ArXiv, 2306.13213.
  40. Tricking LLMs into disobedience: Understanding, analyzing, and preventing jailbreaks. ArXiv, 2305.14965.
  41. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Association for Computational Linguistics.
  42. Maria Rigaki and Sebastian Garcia. 2020. A Survey of Privacy Attacks in Machine Learning. ACM Computing Surveys.
  43. SolidGoldMagikarp (plus, prompt generation). Blogpost.
  44. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. ArXiv, https://arxiv.org/abs/2211.05100.
  45. Christian Schlarmann and Matthias Hein. 2023. On the Adversarial Robustness of Multi-Modal Foundation Models . In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  46. Sander Schulhoff. 2022. Learn Prompting.
  47. Jose Selvi. 2022. Exploring prompt injection attacks. Blogpost.
  48. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Annual Meeting of the Association for Computational Linguistics.
  49. Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models. ArXiv, abs/2307.14539.
  50. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. ArXiv, abs/2308.03825.
  51. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics.
  52. Prompting GPT-3 to be reliable. In ICLR.
  53. An information-theoretic approach to prompt engineering without ground truth labels. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  54. Ludwig-Ferdinand Stumpp. 2023. Achieving code execution in mathGPT via prompt injection.
  55. Terjanq. 2023. Hackaprompt 2023. GitHub repository.
  56. u/Nin_kat. 2023. New jailbreak based on virtual functions - smuggle illegal tokens to the backend.
  57. Protect Your Prompts: Protocols for IP Protection in LLM Applications. ArXiv.
  58. Prompting PaLM for Translation: Assessing Strategies and Performance. In Annual Meeting of the Association for Computational Linguistics.
  59. Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples. Transactions of the Association of Computational Linguistics.
  60. Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Conference of the North American Chapter of the Association for Computational Linguistics.
  61. Jailbroken: How does LLM safety training fail? In Conference on Neural Information Processing Systems.
  62. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Conference on Neural Information Processing Systems.
  63. Simon Willison. 2023. The dual LLM pattern for building AI assistants that can resist prompt injection.
  64. Defending ChatGPT against jailbreak attack via self-reminder. Physical Sciences - Article.
  65. Low-Resource Languages Jailbreak GPT-4. ArXiv, abs/2310.02446.
  66. Shui Yu. 2013. Distributed Denial of Service Attack and Defense. Springer Publishing Company, Incorporated.
  67. Why johnny can’t prompt: How non-ai experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23. Association for Computing Machinery.
  68. AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning . In ACM International Conference on Multimedia.
  69. PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. ArXiv, abs/2306.04528.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Sander Schulhoff (6 papers)
  2. Jeremy Pinto (1 paper)
  3. Anaum Khan (1 paper)
  4. Louis-François Bouchard (3 papers)
  5. Chenglei Si (26 papers)
  6. Svetlina Anati (1 paper)
  7. Valen Tagliabue (1 paper)
  8. Anson Liu Kost (1 paper)
  9. Christopher Carnahan (1 paper)
  10. Jordan Boyd-Graber (68 papers)
Citations (32)
Youtube Logo Streamline Icon: https://streamlinehq.com