Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Construction-Based Jailbreaks

Updated 25 September 2025
  • Construction-based jailbreaks are adversarial prompt engineering techniques that modify input prompts to bypass LLM safety filters and induce unstable outputs.
  • They manipulate prompt structure, context sequencing, and encoding methods to exploit vulnerabilities at both representation and circuit levels.
  • Empirical benchmarks reveal high attack success rates and rapid convergence, emphasizing the need for dynamic, multi-turn mitigation strategies in LLM deployments.

Construction-based jailbreaks are adversarial prompt engineering techniques that intentionally modify the input prompt to LLMs in order to explicitly bypass safety mechanisms, induce output instability, or elicit unsafe, unethical, or policy-violating responses. Unlike token-level optimization or pure instruction role-playing, construction-based jailbreaks are characterized by their deliberate manipulation of the prompt structure, linguistic encoding, context sequencing, or conversational context in order to exploit model vulnerabilities at both the representation and behavioral levels. This article systematically surveys the principles, mechanisms, empirical findings, and practical implications of construction-based jailbreaks, drawing on recent advances and benchmarking efforts across academic and commercial LLMs.

1. Fundamental Principles and Taxonomy

Construction-based jailbreaks operate by perturbing the prompt structure, semantic content, or conversational context so as to circumvent refusal behaviors and aligned output generation. Key characteristics include:

  • Intentional Construction: The adversary explicitly crafts the prompt—using role-playing instructions, encoding schemes, context embeddings, or adversarial templates—to evoke an unsafe response.
  • Universality: As formalized in AutoBreach, a mapping rule is universal if, once optimized for a harmful input x1x_1, it generalizes to diverse goals and models: S(xj,T(F(xj)))=10S(x_j, T(\mathcal{F}(x_j))) = 10 across jj and models TT (Chen et al., 30 May 2024).
  • Adaptability: Effective jailbreaks must evolve with the target’s defences, producing updated transformation rules F\mathcal{F}' that defeat even newly aligned models (Chen et al., 30 May 2024).
  • Efficiency: Modern construction-based methods minimize queries and optimize attacks rapidly, often converging within 10 queries or less (Chen et al., 30 May 2024).

Jailbreaks may employ diverse prompt manipulations—ranging from mild perturbations (formatting, rewording, polite wrappers) (Salinas et al., 8 Jan 2024) to advanced encoding (bijection learning (Huang et al., 2 Oct 2024)), universal template composition (Li et al., 20 Dec 2024), iterative graph-based reasoning (Akbar-Tajari et al., 26 Apr 2025), and multi-turn dialog orchestration (Tang et al., 22 Jun 2025, Yang et al., 9 Aug 2025).

2. Mechanisms of Instability and Evasion

LLMs exhibit marked sensitivity to construction variations, which adversaries leverage to trigger instability and bypass safety filters.

  • Prompt Perturbation: Even minimal formatting changes (e.g., adding a trailing space, “Hello …” prefix) can cause hundreds of label flips in production models such as ChatGPT (Salinas et al., 8 Jan 2024). Output format changes (Python List, JSON, XML) reliably induce >>10% prediction variance.
  • Jailbreak Templates: Constructing prompts like “act as AIM” or “simulate Developer Mode” causes high invalid-output rates in ChatGPT (AIM: 6.3% accuracy, Dev Mode v2: 4.1% accuracy, compared to \approx80% in the baseline), demonstrating extreme fragility under template-based attack (Salinas et al., 8 Jan 2024). The Evil Confidant and Refusal Suppression variations further degrade output reliability.
  • Encoding and Bijection: Adversaries encode malicious queries with bijective string mappings, teach the model translation via in-context learning (ICL), and reverse the mapping post hoc (Huang et al., 2 Oct 2024). Attack efficacy (ASR up to 86.3% on leading models) is tunable by adjusting encoding complexity and dispersion.
  • Adaptive Strategies: Attacks adapt to the semantic processing capability of the target—simple mutation Fu+En1Fu+En_1 (binary tree encryption) for weaker models; Fu+En1+En2Fu+En_1+En_2 (adding output-side Caesar cipher) for advanced models (ASR up to 98.9% on GPT-4o) (Yu et al., 29 May 2025).
  • Graph and Multi-Turn Construction: Instead of linear chains or trees, graph-based frameworks (e.g., GoAT) simultaneously refine multiple prompt candidates, sharing progress to increase coverage and success with fewest queries (Akbar-Tajari et al., 26 Apr 2025). Multi-turn jailbreaking orchestrates conversational paths (Q = {q₁, q₂, ..., q_N}) with global refinement of future queries after each turn, defeating context-aware filters (Tang et al., 22 Jun 2025, Yang et al., 9 Aug 2025).

3. Representation, Circuitry, and Mechanistic Insights

Construction-based jailbreaks achieve their effects by manipulating both representation-level and circuit-level dynamics within LLMs.

  • Representation Shifts: Jailbreak inputs shift latent representations toward “safe” clusters, deceiving well-trained safety probes into misclassifying harmful content as benign (He et al., 17 Nov 2024). The learned “safety direction” vdv_d is systematically circumvented, causing internal confusion between harmful and safe intent.
  • Suppression of Refusal Features: At the circuit level, key components (refusal signal S₋ and affirmation signal S₊) exhibit activation shifts; jailbreaks amplify affirmation while suppressing refusal—measured via refusal score rs=Fc(x)WU[:,w]Fc(x)WU[:,w+]rs = F^c(x) W_U[:,w_{-}] - F^c(x) W_U[:,w_{+}] (He et al., 17 Nov 2024).
  • Latent Space Steering: Jailbreak success correlates with activation shifts: Δaj=ajailabase\Delta a_j^\ell = a^\ell_{jail} - a^\ell_{base}, aggregated as the “jailbreak vector” vjv_j^\ell. Subtracting these vectors from the residual stream can mitigate other, semantically dissimilar jailbreaks; a shared mechanism is suppression of the harmfulness feature direction (Ball et al., 13 Jun 2024).
  • Dependency Graphs: DAG-based analysis identifies causal relationships between attack (or defense) submodules (e.g., seed initialization, mutation, selection), optimizing ensemble attacks and defenses (AutoAttack/AutoDefense) in a modular and interpretable fashion (Lu et al., 6 Jun 2024).

4. Methodological Advances and Benchmarking

Recent work has formalized and scaled construction-based jailbreak attacks across diverse settings and evaluation protocols.

  • Preference Optimization and Automated Template Induction: JailPO employs dual attack models—Question Enhanced Model (QEM; covert rephrasing) and Template Enhanced Model (TEM; automatic scenario/context templates)—trained via supervised fine-tuning and preference optimization, then ranked by success as judged by an aligned LLM (Li et al., 20 Dec 2024). This achieves higher ASR using very few queries relative to handcrafted or static approaches.
  • Experience Reuse: JailExpert builds and dynamically updates a formal experience pool, grouping past successful attacks by semantic drift Δ=Φ(J)Φ(I)\Delta = \Phi(\mathcal{J}) - \Phi(\mathcal{I}) and leveraging representative prompts to maximize ASR and minimize query cost (Wang et al., 25 Aug 2025).
  • Benchmark Expansion: Datasets like HarmBench have been extended to multi-turn settings (MTJ-Bench), systematically varying follow-up question relevance and categorizing queries by linguistic style. Evaluation employs both LLM-based judges and hand-designed multi-dimensional frameworks, measuring harmfulness, compliance, and content quality (Yang et al., 9 Aug 2025, Zheng et al., 5 Sep 2025).
Method Key Feature Notable Success Rates
AutoBreach (Chen et al., 30 May 2024) Wordplay-guided universal mapping \geq80% average ASR (up to 96% on Claude-3)
RLbreaker (Chen et al., 13 Jun 2024) Deep RL agent guided search Surpasses GA/in-context methods on six SOTA LLMs
Bijection Learning (Huang et al., 2 Oct 2024) Random encoding + ICL 86.3% ASR (Claude 3.5 Sonnet); scale-adaptive
JailPO (Li et al., 20 Dec 2024) Automated template + SimPO 6–8x ASR over templates, generalizes across LLMs
JailExpert (Wang et al., 25 Aug 2025) Past experience pool, dynamic 17% higher ASR, 2.7x higher ASR-Efficiency
GoAT (Akbar-Tajari et al., 26 Apr 2025) Graph-of-thoughts black-box Up to 98% ASR, 5x baseline query efficiency

5. Multi-Turn and Camouflaged Jailbreaking

The risk profile escalates in multi-turn and camouflaged scenarios:

  • Many-Turn Jailbreaking: The Multi-Turn Jailbreak Benchmark (MTJ-Bench) demonstrates that, after a single successful jailbreak, LLMs show elevated vulnerability to both relevant and irrelevant follow-up queries (Yang et al., 9 Aug 2025). Attackers exploit long context memory, constructing conversations where subsequent questions elicit detailed, unsafe outputs even in the presence of safety mechanisms.
  • Camouflaged Jailbreaks: Camouflaged attacks hide malicious intent in technical, contextually plausible requests or via multi-turn escalation. The Camouflaged Jailbreak Prompts dataset exposes the inability of models to detect malicious dual-intent challenges, with compliance rates approximating 94% despite high technical feasibility and content quality (Zheng et al., 5 Sep 2025).
Benchmark Setting Primary Risk Identified
MTJ-Bench (Yang et al., 9 Aug 2025) Multi-turn Increased harmful output in irrelevant/relevant turns
Camouflaged Jailbreak Prompts (Zheng et al., 5 Sep 2025) Stealth prompt High full-obedience to camouflaged harmful tasks

6. Theoretical Limits and Transferability

Recent theoretical work has revealed structural and transferability constraints:

  • Inherent Limits: No perfect jailbreak classifier FjbF_{jb} exists for foundation models; composing G=GFjbG' = G \odot F_{jb} always allows construction of a safer GG', contradicting maximality (Rao et al., 18 Jun 2024). Additionally, only strictly Pareto-dominant LLMs can robustly detect jailbreaks in their weaker counterparts, not vice versa.
  • Degree of Transferability: Empirical results show that strong jailbreaks (as measured by efficacy on the source) and the contextual representation similarity between models predict transfer success (Angell et al., 15 Jun 2025). Surrogate models distilled from the target using only benign queries can be more effective sources for transferable jailbreaks, suggesting that vulnerabilities are tied to deep representational flaws rather than superficial safety misalignments.

7. Practical Implications and Mitigation Recommendations

  • Prompt Consistency: For data labeling or operational deployments, rigorous control of prompt structure—including output format—is critical, as small differences are magnified by model instability (Salinas et al., 8 Jan 2024).
  • Defense Limitations: Defense paradigms relying solely on input/output filtering, static refusal vectors, or weak ensemble models are inadequate, especially under scale-adaptive, camouflaged, or multi-turn attacks.
  • Holistic Security: Defense strategies should integrate representation-level and circuit-level monitoring (e.g., tracking “refusal scores” and representation clusters) (He et al., 17 Nov 2024), adaptive activation steering (e.g., ASTRA (Wang et al., 23 Nov 2024)), and context-aware detection spanning multiple conversation turns.
  • Continuous Evaluation: As construction-based jailbreaks can be rapidly adapted and transferred across systems via reusable or auto-generated patterns (e.g., J₂ red teamers (Kritz et al., 9 Feb 2025), experience-driven attackers (Wang et al., 25 Aug 2025)), a cyclical, continually updated red teaming and defense paradigm is required.

In summary, construction-based jailbreaks leverage deliberate prompt engineering, encoding, and context manipulation to systematically evade alignment and safety checks in LLMs. As demonstrated across a diverse body of recent research, these attacks expose both fundamental limitations and practical vulnerabilities in state-of-the-art LLMs and defense mechanisms. Effective mitigation requires moving beyond static safeguards to dynamic, context-sensitive, and multi-level interpretability-aware security architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Construction-Based Jailbreaks.