Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 110 tok/s Pro
GPT OSS 120B 470 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? (2403.06833v3)

Published 11 Mar 2024 in cs.LG and cs.CL

Abstract: Instruction-tuned LLMs show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisingly, there is currently no established definition or benchmark to quantify this phenomenon. In this work, we close this gap by introducing a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs. We also present a new dataset, SEP, that allows estimating the measure for real-world models. Our results on various LLMs show that the problem of instruction-data separation is real: all models fail to achieve high separation, and canonical mitigation techniques, such as prompt engineering and fine-tuning, either fail to substantially improve separation or reduce model utility. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Deep learning with differential privacy. In ACM CCS, 2016.
  2. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
  3. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  4. Handbook of Model Checking. Springer Publishing Company, Incorporated, 2018.
  5. Justin Clarke-Salt. SQL injection attacks and defense. Elsevier, 2009.
  6. Cognitive Computations. Dolphin-2.2.1 mistral-7b. [Link], 2023.
  7. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In AISec Workshop, 2023.
  8. Hewlett Packard. Data execution prevention. [Link], 2005.
  9. Research communities in cyber security: A comprehensive literature review. Computer Science Review, 42:100431, 2021. ISSN 1574-0137.
  10. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023a.
  11. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b.
  12. Microsoft. Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web. [Link], 2023.
  13. Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8:141–163, 2021.
  14. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023a.
  15. OpenAI. Introducing gpts. [Link], 2023b.
  16. OpenAI. Text generation models. [Link], 2023c.
  17. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop, 2022.
  18. Jatmo: Prompt injection defense by task-specific finetuning. arXiv preprint arXiv:2312.17673, 2023.
  19. Teknium. Openhermes-2.5-mistral-7b. [Link], 2023.
  20. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  21. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  22. Jailbroken: How does llm safety training fail? In NeurIPS, 2023.
  23. Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197, 2023.
  24. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  25. Judging llm-as-a-judge with mt-bench and chatbot arena. Conference on Neural Information Processing Systems (NeurIPS), 2023.
  26. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (11)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a formal definition and measurement framework for instruction-data separation using a KL-divergence based score.
  • It applies the SEP dataset of 9,160 examples to evaluate seven LLMs, revealing that even advanced models like GPT-4 show poor separation performance.
  • The findings emphasize significant safety concerns, calling for enhanced training and architectural strategies to mitigate prompt injection risks.

Analyzing Instruction-Data Separation in LLMs

In the paper titled “Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?”, the authors address a significant safety challenge that lacks definition and measurement in the context of instruction-tuned LLMs. As LLMs gain prominence across diverse applications, ensuring their robustness against manipulation becomes imperative. This paper fundamentally interrogates the prevalent shortcomings in how these models differentiate between executable instructions and passive data inputs. The authors introduce formal and empirical measures of instruction-data separation and present their findings through evaluation on a newly introduced dataset.

Background and Motivation

LLMs have become instrumental across multifarious scenarios, ranging from natural language processing to complex analytical tasks. However, the inability of these models to distinctly separate instructions from data introduces vulnerabilities, notably in the form of prompt injection attacks. Current safety mechanisms predominantly defend against explicitly harmful prompts, overlooking this foundational issue. This research aims to formally define instruction-data separation and quantify how contemporary LLMs perform in terms of this separation.

Methodology

To quantify instruction-data separation, the authors propose a separation score based on KL-divergence, capturing how an LLM's behavior differs when probe strings are part of instructions versus data. To operationalize this measure for empirical evaluation, a proxy measure is defined using a dataset termed SEP (Should it be Executed or Processed?), designed to simulate real-world contexts for probing LLM behavior.

SEP Dataset and Experimental Setup

The SEP dataset is meticulously assembled with an array of task categories across information processing, creative endeavors, and analytical evaluations. It includes 9160 elements combining instructions, data prompts, probe strings, and potential surprise witnesses. The dataset is used to ascertain separation scores for several state-of-the-art LLMs, comparing how these models process ostensibly distinct domains of input. The dataset’s open availability facilitates future inquiry and benchmarking.

Results

Evaluation across seven LLMs revealed that none achieved robust separation between instructions and data. Surprisingly, the empirical results indicate that more sophisticated models, such as GPT-4, demonstrate lower separation scores in comparison to others like GPT-3.5. This suggests that model complexity does not necessarily enhance the capability to separate instructions from data. Instead, it might exacerbate the tendency to misinterpret or execute portions of input unintended as commands.

Implications

The findings underscore a crucial limitation with profound implications for the security and efficacy of LLMs. As these models continue to be integrated into systems handling sensitive information or crucial tasks, ensuring that they can reliably distinguish between instructions and data is vital. This research contributes a formal framework for this evaluation, proffering a starting point for developing models that can achieve better separation.

Conclusion and Future Directions

The paper concludes with a call for future work in designing architectures or training strategies that imbue LLMs with a principled understanding of instruction versus data. By offering a foundational metric and dataset, this research invites further exploration into methods that could address these safety gaps, whether through architectural innovations, enhanced training paradigms, or reinforced security mechanisms such as explainability.

In sum, the authors present a critical investigation into an overlooked aspect of LLM architecture, advocating for defining and addressing instruction-data separation as a pivotal criterion for future LLM safety evaluation and training protocols.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.