Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
Abstract: LLMs are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts.
- (Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. ArXiv, abs/2307.10490.
- Eugene Bagdasaryan and Vitaly Shmatikov. 2023. Ceci n’est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings. ArXiv, abs/2308.11804.
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
- Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
- On the Opportunities and Risks of Foundation Models. ArXiv, abs/2108.07258.
- Language models are few-shot learners. In Advances in neural information processing systems.
- Are aligned neural networks adversarially aligned? ArXiv, abs/2306.15447.
- Extracting Training Data from Large Language Models. In USENIX Security Symposium.
- Christopher R. Carnahan. 2023. How a $5000 Prompt Injection Contest Helped Me Become a Better Prompt Engineer. Blogpost.
- Software vulnerabilities: full-, responsible-, and non-disclosure. Technical report.
- How is ChatGPT’s behavior changing over time? ArXiv, abs/2307.09009.
- Razvan Dinu and Hongyi Shi. 2023. NeMo-Guardrails.
- Misusing Tools in Large Language Models With Visual Adversarial Examples . ArXiv, abs/2310.03185.
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. ArXiv, abs/2209.07858.
- PAL: Program-aided Language Models. In International Conference on Machine Learning.
- Datasheets for datasets. Communications of the ACM, 64:86 – 92.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020.
- Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. ArXiv, abs/2302.12173.
- CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study. ArXiv, abs/2307.11346.
- Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. ArXiv, abs/2302.05733.
- MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.
- Prompt waywardness: The curious case of discretized interpretation of continuous prompts. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Best practices and recommendations for cybersecurity service providers. The ethics of cybersecurity, pages 299–316.
- Human-level concept learning through probabilistic program induction. Science.
- Lakera. 2023. Your goal is to make gandalf reveal the secret password for each level.
- Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys.
- Prompt Injection attack against LLM-integrated Applications. ArXiv, abs/2306.05499.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. ArXiv, abs/2305.13860.
- Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. ArXiv, abs/2304.01852.
- Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. In Findings of the Association for Computational Linguistics: ACL 2022.
- Self-Refine: Iterative Refinement with Self-Feedback. ArXiv, abs/2303.17651.
- The AI index 2023 Annual Report.
- Microsoft. 2023. The new Bing and Edge - updates to chat.
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Conference on Empirical Methods in Natural Language Processing.
- OpenAI. 2023. GPT-4 technical report.
- Training language models to follow instructions with human feedback. ArXiv, 2203.02155.
- Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing.
- Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv, 2211.09527.
- Visual Adversarial Examples Jailbreak Large Language Models. ArXiv, 2306.13213.
- Tricking LLMs into disobedience: Understanding, analyzing, and preventing jailbreaks. ArXiv, 2305.14965.
- Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Association for Computational Linguistics.
- Maria Rigaki and Sebastian Garcia. 2020. A Survey of Privacy Attacks in Machine Learning. ACM Computing Surveys.
- SolidGoldMagikarp (plus, prompt generation). Blogpost.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. ArXiv, https://arxiv.org/abs/2211.05100.
- Christian Schlarmann and Matthias Hein. 2023. On the Adversarial Robustness of Multi-Modal Foundation Models . In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Sander Schulhoff. 2022. Learn Prompting.
- Jose Selvi. 2022. Exploring prompt injection attacks. Blogpost.
- On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Annual Meeting of the Association for Computational Linguistics.
- Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models. ArXiv, abs/2307.14539.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. ArXiv, abs/2308.03825.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics.
- Prompting GPT-3 to be reliable. In ICLR.
- An information-theoretic approach to prompt engineering without ground truth labels. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Ludwig-Ferdinand Stumpp. 2023. Achieving code execution in mathGPT via prompt injection.
- Terjanq. 2023. Hackaprompt 2023. GitHub repository.
- u/Nin_kat. 2023. New jailbreak based on virtual functions - smuggle illegal tokens to the backend.
- Protect Your Prompts: Protocols for IP Protection in LLM Applications. ArXiv.
- Prompting PaLM for Translation: Assessing Strategies and Performance. In Annual Meeting of the Association for Computational Linguistics.
- Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples. Transactions of the Association of Computational Linguistics.
- Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Conference of the North American Chapter of the Association for Computational Linguistics.
- Jailbroken: How does LLM safety training fail? In Conference on Neural Information Processing Systems.
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Conference on Neural Information Processing Systems.
- Simon Willison. 2023. The dual LLM pattern for building AI assistants that can resist prompt injection.
- Defending ChatGPT against jailbreak attack via self-reminder. Physical Sciences - Article.
- Low-Resource Languages Jailbreak GPT-4. ArXiv, abs/2310.02446.
- Shui Yu. 2013. Distributed Denial of Service Attack and Defense. Springer Publishing Company, Incorporated.
- Why johnny can’t prompt: How non-ai experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23. Association for Computing Machinery.
- AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning . In ACM International Conference on Multimedia.
- PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. ArXiv, abs/2306.04528.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.