CKA-Agent: Exploiting Correlated Knowledge
- Correlated Knowledge Attack Agent is a dynamic framework that exploits the correlated-knowledge loophole by decomposing harmful intents into benign sub-queries.
- It employs adaptive tree search and knowledge aggregation to synthesize forbidden outputs from individually innocuous fragments.
- Experimental evaluations show 95–98% success rates against state-of-the-art LLMs, exposing critical weaknesses in current safety protocols.
The Correlated Knowledge Attack Agent (CKA-Agent) is a dynamic attack framework that systematically exploits a structural vulnerability in commercially deployed LLMs: the correlated-knowledge loophole. CKA-Agent demonstrates that even strong guardrails are unable to detect malicious objectives distributed across a sequence of individually benign queries by strategically decomposing a harmful intent into harmless sub-queries. Through adaptive tree search and knowledge aggregation, CKA-Agent reconstructs forbidden outputs by weaving together seemingly innocuous fragments, achieving high success rates against state-of-the-art LLMs with established safety protections (Wei et al., 1 Dec 2025).
1. Correlated-Knowledge Vulnerability
LLMs internally encode dense knowledge graphs in which high-level facts are implicitly decomposable into “micro-facts” connected by dense inferential relations. Attempts to elicit disallowed outputs via “prompt-optimization” typically fail because direct queries retain detectable malicious semantics. However, individual sub-facts relevant to a harmful request—such as chemical names, procedures, or component properties—often appear benign in isolation and thus bypass prompt-level guardrails.
The correlated-knowledge loophole arises because the LLM can be queried for these subcomponents, and an attacker can algorithmically aggregate the responses into the desired (but forbidden) global output. This “knowledge weaving” reframes model access as an exploration of the model’s internal knowledge DAG, where distributed intent is never locally visible at any single turn. CKA-Agent is explicitly designed to exploit this loophole.
2. Architecture and Workflow of CKA-Agent
CKA-Agent operationalizes the attack as adaptive tree search over the implicit knowledge structure of the target model. The framework comprises the following modules:
- Attack Agent (Decomposer & Synthesizer): Powered by an open-source LLM, the agent dynamically generates 1–B locally safe sub-queries at each node, conditional on historical context. Upon collecting sufficient fragments, it synthesizes a global output.
- Target Model (Environment): The closed-source LLM (e.g., Gemini-2.5-Flash/Pro, GPT-oss-120B, Claude-Haiku-4.5) serves as a knowledge oracle, responding to sub-queries with factual outputs.
- Evaluator (Node Critic): A separate LLM-based judge that computes a hybrid reward for each new node using
where gauges logical coherence and relevance, scores information gain or assigns a penalty for refusals, and hyperparameter balances the two.
- Online Judge: A classifier verifies if the synthesized output fulfills the predefined harmful objective .
The state of the attack is maintained as a search tree . Each node encodes a benign query , its response , visit count , and a value estimate . Edges connect nodes via candidate expansion actions.
3. Adaptive Tree Search Algorithm
The search proceeds in three primary stages, iterating until the attack succeeds or a computational budget is exhausted:
- Global Selection (UCT): Using the Upper Confidence Bound for Trees (UCT), CKA-Agent selects the leaf that maximizes
with (typically ) as the exploration-exploitation tradeoff parameter.
- Depth-First Expansion:
- If a terminal state (sufficient information or depth ) is reached, mark the node as terminal.
- Otherwise, expand by generating new benign sub-queries , obtain corresponding responses , and score each with the hybrid evaluator.
- Select the highest-scoring child by for further expansion (“greedy DFS descent”).
- Synthesis and Backpropagation:
- The synthesizer aggregates all fragments along the search trajectory into .
- If the online judge recognizes the harmful goal as achieved, the attack terminates successfully.
- Otherwise, assign a negative penalty to the terminal node and update all nodes on the failed path via:
A high-level summary of the CKA-Agent algorithm is as follows:
| Step | Action |
|---|---|
| Global Selection (UCT) | Select promising non-terminal leaf node |
| DFS Expansion | Generate benign sub-queries, score, descend greedily |
| Synthesis & Backpropagation | Aggregate fragments, judge success, update scores if failed |
4. Experimental Evaluation
CKA-Agent was tested on Gemini-2.5-Flash/Pro, GPT-oss-120B, and Claude-Haiku-4.5, using the HarmBench (126 prompts) and StrongREJECT (162 prompts) benchmarks. The attack budget was set as outer iterations, depth, with a (dynamic) branch factor and UCT parameter .
Performance was compared with prompt-optimization (Vanilla, AutoDAN, PAIR, PAP, ActorBreaker, X-Teaming) and static multi-agent decomposition (Multi-Agent Jailbreak) baselines, under equalized query budgets. The main success metric was Full Success (FS), as adjudicated by Gemini-2.5-Flash and validated on other target models.
Results demonstrated that CKA-Agent obtains FS rates of $95$– on all models, outperforming Multi-Agent Jailbreak by $15$–$21$ percentage points and classic prompt-optimization baselines by on Claude-Haiku-4.5 (PAI R baseline). Adaptive branching substantially increased reliability, with of successful attacks completing by the second outer iteration.
5. Robustness and Existing Defense Mechanisms
Existing LLM guardrails—prompt-level filters (e.g., Llama Guard-3), mutation- and lexical-rephrasing defenses, and representation-level (Circuit Breaker) detection—are ineffective against the correlated-knowledge attack paradigm. Prompt-level guards do not flag individually mild sub-queries; mutation defenses only marginally decrease success; Circuit Breaker reduces FS by $30$–$50$ percentage points against prompt-based attacks but still allows FS for CKA-Agent.
In the intent aggregation evaluation (CKA-Agent-Branch), wherein the target LLM received all sub-queries along with the full session history, attack success rates remained above , regardless of model memory capabilities. This indicates a structural deficiency: target models lack mechanisms for multi-turn intent aggregation, restricting content moderation to local prompt-level evaluations.
6. Implications and Proposed Mitigations
CKA-Agent underscores a fundamental LLM vulnerability: the inability of commercial guardrails to aggregate session-spanning intent and detect globally harmful objectives distributed among benign interactions. The framework’s adaptive tree search exposes the limits of local, prompt-centric moderation when confronted with knowledge-decomposition attacks.
To mitigate such threats, multiple defense lines are proposed:
- Design context-aware guardrails that accumulate semantic evidence across multi-turn sessions.
- Train multi-turn safety classifiers capable of detecting distributed or latent harmful intent.
- Combine machine judgment with human-in-the-loop oversight for detecting and intercepting long-horizon attacks.
A plausible implication is that effective countermeasures must move beyond static, prompt-level screening toward trajectory-level semantic aggregation and intent inference, with new detection tools that reason over entire conversational paths (Wei et al., 1 Dec 2025).