Root cause of indirect prompt injection attacks

Prove that the root cause of indirect prompt injection attacks against large language models is twofold: (i) the inability of large language models to distinguish between external content and user instructions, and (ii) the absence of awareness in large language models to refrain from executing instructions embedded within external content.

Background

The paper studies indirect prompt injection attacks in applications where LLMs ingest external content (e.g., web pages, emails, tables, code snippets). The authors construct a benchmark (BIPIA) and empirically show that a range of LLMs are vulnerable to such attacks, with higher-capability models often exhibiting higher attack success rates on text tasks.

To explain why these attacks succeed, the authors articulate a conjecture asserting that two specific factors underlie the vulnerability: LLMs struggle to distinguish external content from user instructions, and LLMs lack awareness to avoid executing instructions embedded in the external content. They develop black-box and white-box defenses motivated by this conjecture, but the conjecture itself remains to be formally validated.

References

To explain the success of indirect prompt injection attacks, we propose the following conjecture: The root cause of indirect prompt injection attacks is twofold: firstly, the LLMs' inability to distinguish between external content and user instructions; and secondly, the absence of LLMs' awareness to not execute instructions embedded within external content.

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models  (2312.14197 - Yi et al., 2023) in Methods, Defenses Against Indirect Prompt Injection, Conjecture 1