Whether to use non-instructional content from a compromised source after detecting prompt injection

Determine whether a large language model agent should continue to use any non-instructional content from the same external document or data source after the agent has detected malicious instructions embedded in that source during retrieval or tool use.

Background

The paper argues that even if model-level defenses enable a model to ignore malicious instructions, a crucial system-security question persists about handling other content from the same potentially compromised source. If an external document contains injected instructions, the remainder of its content may also be untrustworthy, raising uncertainty about whether it should be used at all.

This uncertainty motivates the paper's proposed system-level approach of decoupling instruction recognition from instruction-following decisions and constraining model inputs to structured artifacts, but the authors explicitly note that the underlying question remains unresolved by current model-robustness work.

References

From a system-security perspective, however, an important question remains: even if a model can reliably ignore malicious instructions, should it still rely on other content from the same source once prompt injection is detected? For example, if the model correctly identifies malicious instructions on an external document, should it continue using other content from that potentially compromised document? Probably not in many settings. We view this as an important security question that is not yet fully resolved by current model-robustness work.

— Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks (2603.30016 - Xiang et al., 31 Mar 2026) in Section 3.2 (Position 2), Notes on existing pure model-level defenses takeaways box

Whether to use non-instructional content from a compromised source after detecting prompt injection

Background

References

Related Problems