Prompt injection risk in always-trusted-edit workflows

Investigate the vulnerability of AI control protocols that always apply Trusted Editing (TE), where every action from the untrusted model is edited by a trusted LLM, to prompt injection attacks that target the trusted editor model, and determine whether such injections can subvert the editing mechanism and bypass safety guarantees.

Background

The paper demonstrates that current AI control protocols relying on trusted monitors are subverted by adaptive prompt injection attacks. While the authors find that Always-Defer policies (including always deferring to trusted models for editing) avoid dependence on a vulnerable monitor, they caution that workflows that always apply Trusted Editing (TE) may themselves be susceptible to injections against the trusted editor.

Because the trusted editor is an LLM subject to similar instruction-data mixing vulnerabilities as monitors, establishing whether and how prompt injections can corrupt the editor’s behavior is essential for understanding the safety of always-edit protocols and for designing robust defenses.

References

However, always-trusted-edit workflows may be vulnerable to prompt injections against the trusted editor. We leave investigation of these attack vectors to future work.

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols (2510.09462 - Terekhov et al., 10 Oct 2025) in Discussion, Limitations