Detecting Concealed Indirect Prompt Injections via CoT or Activation Monitoring
Investigate whether monitoring complete chain-of-thought traces or probing internal neural activations using activation probes can detect concealed indirect prompt injection compromises that are not evident from user-facing outputs in the Indirect Prompt Injection Arena scenarios.
References
Since our evaluation focused on user-facing outputs, an open question is whether monitoring full CoT traces or internal representations via activation probes~\citep{kramar2026building} could detect concealed attacks that evade output-level scrutiny.
— How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition
(2603.15714 - Dziemian et al., 16 Mar 2026) in Discussion, Limitations, and Future Work — Call for system- or architecture-level defense in deployment