Maatphor: Automated Variant Analysis for Prompt Injection Attacks (2312.11513v1)
Abstract: Prompt injection has emerged as a serious security threat to LLMs. At present, the current best-practice for defending against newly-discovered prompt injection techniques is to add additional guardrails to the system (e.g., by updating the system prompt or using classifiers on the input and/or output of the model.) However, in the same way that variants of a piece of malware are created to evade anti-virus software, variants of a prompt injection can be created to evade the LLM's guardrails. Ideally, when a new prompt injection technique is discovered, candidate defenses should be tested not only against the successful prompt injection, but also against possible variants. In this work, we present, a tool to assist defenders in performing automated variant analysis of known prompt injection attacks. This involves solving two main challenges: (1) automatically generating variants of a given prompt according, and (2) automatically determining whether a variant was effective based only on the output of the model. This tool can also assist in generating datasets for jailbreak and prompt injection attacks, thus overcoming the scarcity of data in this domain. We evaluate Maatphor on three different types of prompt injection tasks. Starting from an ineffective (0%) seed prompt, Maatphor consistently generates variants that are at least 60% effective within the first 40 iterations.
- Bad Characters: Imperceptible NLP Attacks.
- Extracting Training Data from Large Language Models. arXiv:2012.07805.
- How is ChatGPT’s behavior changing over time?
- BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements. In Annual Computer Security Applications Conference, ACSAC ’21. ACM.
- More than You’ve Asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models.
- Backdoor Attacks for In-Context Learning with Language Models. arXiv:2307.14692.
- User Inference Attacks on Large Language Models. arXiv:2310.09266.
- Multi-step Jailbreaking Privacy Attacks on ChatGPT.
- Prompt Injection attack against LLM-integrated Applications.
- Analyzing Leakage of Personally Identifiable Information in Language Models. arXiv:2302.00539.
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv:2312.02119.
- Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks. arXiv:2203.03929.
- Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527.
- Automatic Prompt Optimization with ”Gradient Descent” and Beam Search.
- Reddit. 2023. r/ChatGPTJailbreak.
- Learning and Classification of Malware Behavior. In Detection of Intrusions and Malware, and Vulnerability Assessment. ISBN 978-3-540-70542-0.
- ”Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts.
- MPNet: Masked and Permuted Pre-training for Language Understanding. arXiv preprint arXiv:2004.09297.
- Black-Box Tuning for Language-Model-as-a-Service.
- Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection. arXiv:2306.08833.
- Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning. arXiv:2310.11397.
- Defending ChatGPT against Jailbreak Attack via Self-Reminder. 15.
- Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection. arXiv:2307.16888.
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. arXiv:2308.06463.
- Universal and Transferable Adversarial Attacks on Aligned Language Models.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.