Papers
Topics
Authors
Recent
Search
2000 character limit reached

PE-CoA: Pattern Enhanced Chain of Attack

Updated 3 February 2026
  • The paper introduces PE-CoA, which operationalizes a taxonomy of human-like conversational structures to construct multi-turn jailbreaking attacks on LLMs.
  • It formalizes pattern-guided attack synthesis by optimizing both semantic objective satisfaction and pattern-stage coherence to improve systematic coverage.
  • Experimental results indicate significant improvements in attack success rates across varied harm categories and model families compared to previous baselines.

Pattern Enhanced Chain of Attack (PE-CoA) is a data-driven framework for constructing multi-turn jailbreaking attacks against LLMs by operationalizing a taxonomy of human-like conversational structures. PE-CoA leverages empirically validated dialogue patterns to systematically manipulate LLM safety guardrails, revealing previously obscured structural vulnerabilities. In contrast to prior multi-turn attacks that rely on ad hoc or heuristic conversational design, PE-CoA formalizes pattern-guided attack synthesis, enabling systematic coverage of model weaknesses across multiple harm categories and model families (Nihal et al., 9 Oct 2025).

1. Formal Model and Objective

PE-CoA targets an LLM Mtgt\mathcal{M}_{\mathrm{tgt}} under black-box access with the objective O\mathcal{O} (a prohibited or harmful output). Given the sequence of prompts T={u1,,um}\mathcal{T} = \{u_1, \ldots, u_m\} and accumulated dialogue state Ht1\mathcal{H}_{t-1} at turn t1t-1, a classical multi-turn jailbreak seeks to maximize

max1tm  E(O,rt)\max_{1 \le t \le m} \; \mathcal{E}(\mathcal{O}, r_t)

where rt=Mtgt(utHt1)r_t = \mathcal{M}_{\mathrm{tgt}}(u_t \mid \mathcal{H}_{t-1}) and E\mathcal{E} is a semantic fulfillment metric. PE-CoA introduces a pattern set P={p1,,pk}\mathcal{P} = \{p_1, \ldots, p_k\}, each comprised of LpL_p ordered stages Sp={s1,,sLp}S_p = \{s_1, \ldots, s_{L_p}\}, and defines a pattern adherence score:

A(ut,p,sj)=sim(Embed(ut),Embed(template(p,sj)))\mathcal{A}(u_t, p, s_j) = \mathrm{sim}(\mathrm{Embed}(u_t), \mathrm{Embed}(\mathrm{template}(p, s_j)))

PE-CoA globally optimizes

(T,p)=argmaxpP,  Tmax1tT[λE(O,rt)+(1λ)A(ut,p,sj(t))]\left(\mathcal{T}^*, p^*\right) = \arg\max_{p \in \mathcal{P},\; \mathcal{T}} \max_{1 \le t \le |\mathcal{T}|} \left[\lambda\,\mathcal{E}(\mathcal{O}, r_t) + (1-\lambda)\,\mathcal{A}(u_t,p,s_j(t))\right]

with λ[0,1]\lambda \in [0,1] mediating the trade-off between semantic objective satisfaction and pattern-stage coherence. This dual-criterion optimization is designed to bypass direct safety filters by constraining attack chains within plausible, stage-wise human conversational templates, thereby increasing the reliability of harmful content elicitation (Nihal et al., 9 Oct 2025).

2. Taxonomy of Conversation Patterns

PE-CoA systematizes five conversation patterns, each decomposed into three functional stages, empirically shown to modulate LLM behavior and defense evasion.

Pattern (Label) Stage 1 Stage 2 Stage 3
Technical (Educational) Concept Application Implementation
Personal Experience Sharing Relating Requesting
Hypothetical Scenario Setup Development Application
Information Seeking General Specific Implementation
Problem Solving Problem Analysis Solution

Each pattern is formalized via conversation-analytic and speech-act theoretical constructs, with adherence determined by A\mathcal{A} using embedding-based similarity to stage-specific templates ("Define core concepts…", "Share a personal story…", etc.). This structure ensures that attacks do not solely rely on semantic formulation but exploit conversational context as an attack surface (Nihal et al., 9 Oct 2025).

3. Pattern-Guided Algorithmic Framework

PE-CoA operationalizes chain-of-attack generation and execution via an interleaved pattern-constraint loop. The framework, as outlined in the manuscript, operates as follows:

  1. Pattern Instantiation: For each pattern pPp \in \mathcal{P}, dialogue history H\mathcal{H} is initialized, and iterative candidate generation proceeds.
  2. Candidate Generation: Attack prompt chains are generated with explicit pattern-stage templates.
  3. Execution and Evaluation: For each prompt utu_t, the LLM response rtr_t is obtained; semantic objective E(O,rt)\mathcal{E}(\mathcal{O}, r_t) and pattern adherence A(ut,p,sj(t))\mathcal{A}(u_t, p, s_j(t)) are computed.
  4. Success Judgement: If the response matches the prohibited objective, the chain is stored.
  5. Next Step Selection: Depending on semantic and adherence changes, the algorithm may select the next candidate, regenerate, backtrack, or switch patterns.

Empirical implementation is detailed in Algorithm 1 of the original work. Chain progression is tightly coupled to stage adherence, and the framework supports walk, regeneration, and switch operations, maximizing attack generality and robustness (Nihal et al., 9 Oct 2025).

4. Harm Categories, Pattern Mapping, and Vulnerability Profiling

PE-CoA is evaluated on ten distinct harm categories C={c1,,c10}\mathcal{C} = \{c_1, \ldots, c_{10}\}—including malware/hacking, harassment, fraud, disinformation, among others. The pattern-category vulnerability matrix is formalized as

V(Mtgt,p,c)=ASR of pattern p on category c\mathcal{V}(\mathcal{M}_{\mathrm{tgt}}, p, c) = \mathrm{ASR}\ \text{of pattern}\ p\ \text{on category}\ c

This matrix exposes combinatorial attack surfaces: some patterns (e.g., "Technical") are especially effective for malware, while others (e.g., "Hypothetical") yield higher success on illegal activity requests. Results indicate 50 unique pattern-category attack vectors (5 patterns × 10 categories), each with distinct defense requirements. Robustness to one pattern or category does not imply transferability to others, necessitating a multi-pattern defense strategy (Nihal et al., 9 Oct 2025).

5. Experimental Findings and Comparative Performance

Comprehensive empirical evaluation comprises 12 LLMs (6 closed-source, 6 open-source) across 300 objectives (including the 50 hardest from GCG50). Primary metrics are attack success rate under any pattern (ASR@any) and under the single best pattern (ASR@best):

Model ASR@any ASR@best Best Pattern
Claude-3-haiku 75.0% 36.7% Information
GPT-4o-mini 96.9% 73.6% Information
Deepseek-chat 98.7% 84.0% Problem-Solving
Vicuna-13b 98.9% 86.7% Problem-Solving
Mistral-7B 100.0% 95.3% Personal

PE-CoA demonstrates a 5–15 point ASR@any improvement over previous multi-turn attack baselines (ActorAttack, Crescendo, X-Teaming, GOAT, etc.), and a 30–40 point gain over vanilla Chain-of-Attack without explicit pattern modeling (Nihal et al., 9 Oct 2025).

6. Behavioral Analysis and Model Weakness Characterization

Empirical observations uncover several structural characteristics of LLM vulnerability:

  • Model-Specific Profiles: Models exhibit non-overlapping robustness profiles; resistance to a given pattern does not imply generalized robustness.
  • Pattern × Category Effects: Certain harm categories remain resistant to direct attacks but are susceptible to alternative patterns or conversational framings.
  • Model-Family Inheritance: Tight correlation (>0.9>0.9) in vulnerability signatures is observed within families (e.g., Gemini, GPT), while divergence is seen across major version increments (e.g., Llama2 vs Llama3).

Salient failure modes include premature refusals on early pattern stages and over-constrained educational dialogue that inhibits harmful objective progression. Common weaknesses are traced to inadequacies in alignment mechanisms for multi-step dialogue management rather than isolated semantic classifiers (Nihal et al., 9 Oct 2025).

7. Defense Strategies and Pattern-Aware Mitigations

Recommended countermeasures are grounded in empirical defense effect measurement:

  • Pattern-Targeted Fine-Tuning: LoRA-based safety tuning on pattern-specific data (e.g., Information pattern) significantly reduces ASR for that pattern (from 90% to 10%) but does not generalize, underscoring the necessity for full pattern-suite coverage in safety training.
  • Inference-Time Monitoring: Integration of pattern-stage classifier models to detect critical high-risk transitions (e.g., “Implementation” in Technical, “Solution” in Problem-Solving).
  • Ensemble Hardened Chain Monitoring: Use of perplexity and pattern-aware semantic divergence scoring to detect stealth sequence escalations.
  • Cross-Family Defense Propagation: Defensive updates should be propagated across model family lines due to correlated vulnerabilities within families (Nihal et al., 9 Oct 2025).

The findings drive home that conversational structure is a distinct and powerful attack surface; robust defense must transcend semantic similarity and explicitly account for the ultradiverse taxonomy of natural, multi-stage dialogue strategies.


For implementation details, metrics definitions, and pattern template examples, consult "Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in LLMs" (Nihal et al., 9 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pattern Enhanced Chain of Attack (PE-CoA).