Whether to use non-instructional content from a compromised source after detecting prompt injection
Determine whether a large language model agent should continue to use any non-instructional content from the same external document or data source after the agent has detected malicious instructions embedded in that source during retrieval or tool use.
References
From a system-security perspective, however, an important question remains: even if a model can reliably ignore malicious instructions, should it still rely on other content from the same source once prompt injection is detected? For example, if the model correctly identifies malicious instructions on an external document, should it continue using other content from that potentially compromised document? Probably not in many settings. We view this as an important security question that is not yet fully resolved by current model-robustness work.