Prompt injection risk in always-trusted-edit workflows
Investigate the vulnerability of AI control protocols that always apply Trusted Editing (TE), where every action from the untrusted model is edited by a trusted LLM, to prompt injection attacks that target the trusted editor model, and determine whether such injections can subvert the editing mechanism and bypass safety guarantees.
Sponsor
References
However, always-trusted-edit workflows may be vulnerable to prompt injections against the trusted editor. We leave investigation of these attack vectors to future work.
— Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
(2510.09462 - Terekhov et al., 10 Oct 2025) in Discussion, Limitations