Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents

Published 7 Dec 2025 in cs.AI, cs.CL, and cs.CR | (2512.06716v1)

Abstract: Autonomous LLM agents exhibit significant vulnerability to Indirect Prompt Injection (IPI) attacks. These attacks hijack agent behavior by polluting external information sources, exploiting fundamental trade-offs between security and functionality in existing defense mechanisms. This leads to malicious and unauthorized tool invocations, diverting agents from their original objectives. The success of complex IPIs reveals a deeper systemic fragility: while current defenses demonstrate some effectiveness, most defense architectures are inherently fragmented. Consequently, they fail to provide full integrity assurance across the entire task execution pipeline, forcing unacceptable multi-dimensional compromises among security, functionality, and efficiency. Our method is predicated on a core insight: no matter how subtle an IPI attack, its pursuit of a malicious objective will ultimately manifest as a detectable deviation in the action trajectory, distinct from the expected legitimate plan. Based on this, we propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision. CCA constructs an efficient, dual-layered defense system through two synergistic pillars: (i) proactive and preemptive control-flow and data-flow integrity enforcement via a pre-generated "Intent Graph"; and (ii) an innovative "Tiered Adjudicator" that, upon deviation detection, initiates deep reasoning based on multi-dimensional scoring, specifically designed to counter complex conditional attacks. Experiments on the AgentDojo benchmark substantiate that CCA not only effectively withstands sophisticated attacks that challenge other advanced defense methods but also achieves uncompromised security with notable efficiency and robustness, thereby reconciling the aforementioned multi-dimensional trade-off.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a Cognitive Control Architecture that employs a dual-layer approach combining an Intent Graph and a Tiered Adjudicator to defend against indirect prompt injection attacks.
The methodology leverages proactive planning and reactive deep reasoning, achieving a 97% reduction in attack success on the AgentDojo benchmark.
The results imply that robust lifecycle supervision can harmonize security, functionality, and efficiency for deploying autonomous AI agents in high-stakes environments.

Cognitive Control Architecture for Robustly Aligned AI Agents

Introduction

LLM agents have demonstrated considerable potential in autonomously completing sophisticated tasks across diverse domains. However, this autonomy brings with it significant security challenges. One of the foremost threats these agents face is Indirect Prompt Injection (IPI) attacks, where external information contaminated with malicious instructions can hijack agent behavior, leading to unauthorized tool usage and objective deviations. The paper "Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents" (2512.06716) introduces a novel architectural framework to robustly align AI agents against such vulnerabilities, thereby addressing the systemic fragility inherent in existing defenses.

Figure 1: An illustrative example of a multi-step Indirect Prompt Injection (IPI) attack.

Architectural Design

The proposed Cognitive Control Architecture (CCA) is constructed on two synergistic pillars for achieving full-lifecycle cognitive supervision: the Intent Graph and the Tiered Adjudicator. The intent behind these pillars is to ensure both control-flow and data-flow integrity throughout the agent's operation.

The Intent Graph serves as the proactive layer, defining a legitimate sequence of tool calls predicated on the user's goal. This graph acts as a template against which all actions proposed by the agent are verified to ensure they comply with predetermined sequences and data-origin rules (Figure 2).

Figure 2: The Cognitive Control Architecture (CCA) operates in two layers, providing structured supervision through Intent Graphs and deep reasoning via Tiered Adjudicators.

The Tiered Adjudicator operates as the reactive layer, initiating deep reasoning upon detection of deviations from the Intent Graph. It utilizes a multi-faceted Intent Alignment Score, composed of semantic alignment, causal contribution, source provenance, and inherent action risk assessments, to adjudicate actions that deviate from expected trajectories. This mechanism is vital for countering sophisticated attacks by evaluating the depth of alignment between current actions and user goals.

Evaluation and Results

The effectiveness of CCA was rigorously validated on the AgentDojo benchmark, a testbed comprising 97 multi-step tasks emulating real-world operational complexity. Across these experiments, CCA demonstrated superior resilience against sophisticated IPI attacks, reducing the Attack Success Rate (ASR) by over 97% compared to an undefended baseline. Critically, it maintained high levels of Benign Utility (BU) and Utility Under Attack (UA), showcasing its ability to preserve functionality while providing robust security.

Figure 3: Trade-off analysis between DSR (defense success rate) and TSR (task success rate) in CCA.

Implications and Future Developments

The implications of CCA are multifaceted. Practically, it offers a blueprint for deploying AI agents in high-stakes environments, where security and task completion reliability are paramount. Theoretically, it challenges existing paradigms that treat security, functionality, and efficiency as mutually exclusive trade-offs, demonstrating that a well-architected supervision system can reconcile these dimensions. Future developments could explore dynamic graph refinement and context-aware risk modeling to enhance adaptability in scenarios where plans cannot be fully anticipated upfront.

Conclusion

The Cognitive Control Architecture provides a robust and scalable solution for aligning autonomous AI agents, effectively addressing security vulnerabilities while minimizing functional compromises. Its layered approach, synergizing proactive integrity assurance with reactive alignment adjudication, sets a new standard for agent security frameworks. As AI agents continue to evolve, CCA will likely play a pivotal role in their safe and efficient deployment across increasingly complex applications.

Markdown Report Issue