Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? (2403.06833v3)
Abstract: Instruction-tuned LLMs show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisingly, there is currently no established definition or benchmark to quantify this phenomenon. In this work, we close this gap by introducing a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs. We also present a new dataset, SEP, that allows estimating the measure for real-world models. Our results on various LLMs show that the problem of instruction-data separation is real: all models fail to achieve high separation, and canonical mitigation techniques, such as prompt engineering and fine-tuning, either fail to substantially improve separation or reduce model utility. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.
- Deep learning with differential privacy. In ACM CCS, 2016.
- On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Handbook of Model Checking. Springer Publishing Company, Incorporated, 2018.
- Justin Clarke-Salt. SQL injection attacks and defense. Elsevier, 2009.
- Cognitive Computations. Dolphin-2.2.1 mistral-7b. [Link], 2023.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In AISec Workshop, 2023.
- Hewlett Packard. Data execution prevention. [Link], 2005.
- Research communities in cyber security: A comprehensive literature review. Computer Science Review, 42:100431, 2021. ISSN 1574-0137.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023a.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b.
- Microsoft. Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. [Link], 2023.
- Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8:141–163, 2021.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023a.
- OpenAI. Introducing gpts. [Link], 2023b.
- OpenAI. Text generation models. [Link], 2023c.
- Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop, 2022.
- Jatmo: Prompt injection defense by task-specific finetuning. arXiv preprint arXiv:2312.17673, 2023.
- Teknium. Openhermes-2.5-mistral-7b. [Link], 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Jailbroken: How does llm safety training fail? In NeurIPS, 2023.
- Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197, 2023.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.