Detecting Concealed Indirect Prompt Injections via CoT or Activation Monitoring

Investigate whether monitoring complete chain-of-thought traces or probing internal neural activations using activation probes can detect concealed indirect prompt injection compromises that are not evident from user-facing outputs in the Indirect Prompt Injection Arena scenarios.

Background

The study evaluates indirect prompt injection attacks that both execute a harmful action and conceal the compromise in the model’s final, user-facing response across tool use, coding, and computer use agents. Because users typically only see the final response, attacks that avoid leaving clues there are particularly dangerous.

The authors note that their evaluation focused solely on visible outputs, raising the possibility that deeper signals—such as chain-of-thought traces or internal activations—might reveal concealed compromises. They explicitly frame this as an open question regarding the effectiveness of CoT monitoring or activation probing to detect attacks that evade output-level scrutiny.

References

Since our evaluation focused on user-facing outputs, an open question is whether monitoring full CoT traces or internal representations via activation probes~\citep{kramar2026building} could detect concealed attacks that evade output-level scrutiny.

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition  (2603.15714 - Dziemian et al., 16 Mar 2026) in Discussion, Limitations, and Future Work — Call for system- or architecture-level defense in deployment