Papers
Topics
Authors
Recent
Search
2000 character limit reached

CKA-Agent: Exploiting Correlated Knowledge

Updated 8 December 2025
  • Correlated Knowledge Attack Agent is a dynamic framework that exploits the correlated-knowledge loophole by decomposing harmful intents into benign sub-queries.
  • It employs adaptive tree search and knowledge aggregation to synthesize forbidden outputs from individually innocuous fragments.
  • Experimental evaluations show 95–98% success rates against state-of-the-art LLMs, exposing critical weaknesses in current safety protocols.

The Correlated Knowledge Attack Agent (CKA-Agent) is a dynamic attack framework that systematically exploits a structural vulnerability in commercially deployed LLMs: the correlated-knowledge loophole. CKA-Agent demonstrates that even strong guardrails are unable to detect malicious objectives distributed across a sequence of individually benign queries by strategically decomposing a harmful intent into harmless sub-queries. Through adaptive tree search and knowledge aggregation, CKA-Agent reconstructs forbidden outputs by weaving together seemingly innocuous fragments, achieving high success rates against state-of-the-art LLMs with established safety protections (Wei et al., 1 Dec 2025).

1. Correlated-Knowledge Vulnerability

LLMs internally encode dense knowledge graphs in which high-level facts are implicitly decomposable into “micro-facts” connected by dense inferential relations. Attempts to elicit disallowed outputs via “prompt-optimization” typically fail because direct queries retain detectable malicious semantics. However, individual sub-facts relevant to a harmful request—such as chemical names, procedures, or component properties—often appear benign in isolation and thus bypass prompt-level guardrails.

The correlated-knowledge loophole arises because the LLM can be queried for these subcomponents, and an attacker can algorithmically aggregate the responses into the desired (but forbidden) global output. This “knowledge weaving” reframes model access as an exploration of the model’s internal knowledge DAG, where distributed intent is never locally visible at any single turn. CKA-Agent is explicitly designed to exploit this loophole.

2. Architecture and Workflow of CKA-Agent

CKA-Agent operationalizes the attack as adaptive tree search over the implicit knowledge structure of the target model. The framework comprises the following modules:

  • Attack Agent (Decomposer & Synthesizer): Powered by an open-source LLM, the agent dynamically generates 1–B locally safe sub-queries at each node, conditional on historical context. Upon collecting sufficient fragments, it synthesizes a global output.
  • Target Model (Environment): The closed-source LLM (e.g., Gemini-2.5-Flash/Pro, GPT-oss-120B, Claude-Haiku-4.5) serves as a knowledge oracle, responding to sub-queries with factual outputs.
  • Evaluator (Node Critic): A separate LLM-based judge that computes a hybrid reward for each new node using

fv=αfintrospect+(1α)ffeedbackf_v = \alpha f_{\text{introspect}} + (1-\alpha)f_{\text{feedback}}

where fintrospectf_{\text{introspect}} gauges logical coherence and relevance, ffeedbackf_{\text{feedback}} scores information gain or assigns a penalty for refusals, and hyperparameter α\alpha balances the two.

  • Online Judge: A classifier J(y^,h)J(\hat y, h) verifies if the synthesized output y^\hat y fulfills the predefined harmful objective hh.

The state of the attack is maintained as a search tree T\mathcal T. Each node vv encodes a benign query xvx_v, its response yvy_v, visit count NvN_v, and a value estimate fvf_v. Edges connect nodes via candidate expansion actions.

3. Adaptive Tree Search Algorithm

The search proceeds in three primary stages, iterating until the attack succeeds or a computational budget is exhausted:

  1. Global Selection (UCT): Using the Upper Confidence Bound for Trees (UCT), CKA-Agent selects the leaf vLv_L that maximizes

vL=argmaxvVleafVterminal[fv+clnNparent(v)Nv]v_L = \arg\max_{v \in \mathcal V_\text{leaf} \setminus \mathcal V_\text{terminal}} \Bigl[ f_v + c\sqrt{\frac{\ln N_{\text{parent}(v)}}{N_v}} \Bigr]

with cc (typically 2\sqrt2) as the exploration-exploitation tradeoff parameter.

  1. Depth-First Expansion:
    • If a terminal state (sufficient information or depth Dmax=5D_{\max}=5) is reached, mark the node as terminal.
    • Otherwise, expand by generating BB new benign sub-queries x(1),,x(B)x^{(1)}, \dots, x^{(B)}, obtain corresponding responses y(j)y^{(j)}, and score each with the hybrid evaluator.
    • Select the highest-scoring child by fvf_v for further expansion (“greedy DFS descent”).
  2. Synthesis and Backpropagation:

    • The synthesizer aggregates all fragments along the search trajectory into y^\hat y.
    • If the online judge J(y^,h)J(\hat y, h) recognizes the harmful goal as achieved, the attack terminates successfully.
    • Otherwise, assign a negative penalty fpen<0f_{\text{pen}} < 0 to the terminal node and update all nodes on the failed path via:

    NvNv+1,fv(Nv1)fv+fpenNvN_v \leftarrow N_v + 1, \quad f_v \leftarrow \frac{(N_v-1)f_v + f_{\text{pen}}}{N_v}

A high-level summary of the CKA-Agent algorithm is as follows:

Step Action
Global Selection (UCT) Select promising non-terminal leaf node vLv_L
DFS Expansion Generate BB benign sub-queries, score, descend greedily
Synthesis & Backpropagation Aggregate fragments, judge success, update scores if failed

4. Experimental Evaluation

CKA-Agent was tested on Gemini-2.5-Flash/Pro, GPT-oss-120B, and Claude-Haiku-4.5, using the HarmBench (126 prompts) and StrongREJECT (162 prompts) benchmarks. The attack budget was set as Tmax=5T_{\max} = 5 outer iterations, Dmax=5D_{\max} = 5 depth, with a (dynamic) branch factor B{1,2,3}B \in \{1,2,3\} and UCT parameter c=1.414c=1.414.

Performance was compared with prompt-optimization (Vanilla, AutoDAN, PAIR, PAP, ActorBreaker, X-Teaming) and static multi-agent decomposition (Multi-Agent Jailbreak) baselines, under equalized query budgets. The main success metric was Full Success (FS), as adjudicated by Gemini-2.5-Flash and validated on other target models.

Results demonstrated that CKA-Agent obtains FS rates of $95$–98%98\% on all models, outperforming Multi-Agent Jailbreak by $15$–$21$ percentage points and classic prompt-optimization baselines by 20×\gg 20 \times on Claude-Haiku-4.5 (PAI R baseline). Adaptive branching substantially increased reliability, with 90%\sim 90\% of successful attacks completing by the second outer iteration.

5. Robustness and Existing Defense Mechanisms

Existing LLM guardrails—prompt-level filters (e.g., Llama Guard-3), mutation- and lexical-rephrasing defenses, and representation-level (Circuit Breaker) detection—are ineffective against the correlated-knowledge attack paradigm. Prompt-level guards do not flag individually mild sub-queries; mutation defenses only marginally decrease success; Circuit Breaker reduces FS by $30$–$50$ percentage points against prompt-based attacks but still allows FS 80%\approx 80\% for CKA-Agent.

In the intent aggregation evaluation (CKA-Agent-Branch), wherein the target LLM received all sub-queries along with the full session history, attack success rates remained above 90%90\%, regardless of model memory capabilities. This indicates a structural deficiency: target models lack mechanisms for multi-turn intent aggregation, restricting content moderation to local prompt-level evaluations.

6. Implications and Proposed Mitigations

CKA-Agent underscores a fundamental LLM vulnerability: the inability of commercial guardrails to aggregate session-spanning intent and detect globally harmful objectives distributed among benign interactions. The framework’s adaptive tree search exposes the limits of local, prompt-centric moderation when confronted with knowledge-decomposition attacks.

To mitigate such threats, multiple defense lines are proposed:

  • Design context-aware guardrails that accumulate semantic evidence across multi-turn sessions.
  • Train multi-turn safety classifiers capable of detecting distributed or latent harmful intent.
  • Combine machine judgment with human-in-the-loop oversight for detecting and intercepting long-horizon attacks.

A plausible implication is that effective countermeasures must move beyond static, prompt-level screening toward trajectory-level semantic aggregation and intent inference, with new detection tools that reason over entire conversational paths (Wei et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Correlated Knowledge Attack Agent (CKA-Agent).