Jailbreak Attack Taxonomy in LLMs

Updated 26 October 2025

Jailbreak attack taxonomy is a systematic framework classifying adversarial manipulations of LLM inputs along technique (e.g., orthographic, lexical) and intent (e.g., information leakage, misalignment).
The framework formalizes input manipulation by modeling LLM behavior and categorizes methods like direct instruction, few-shot hacking, and cognitive hacking.
Empirical insights reveal varied vulnerabilities across models and underscore the need for multi-layered detection strategies to counter composite attack designs.

Jailbreak attacks are deliberate strategies aimed at subverting the safety and alignment mechanisms of LLMs, inducing models to produce outputs that violate developer intent, regulatory policies, or ethical norms. The taxonomy of jailbreak attacks has evolved to reflect both the increasing diversity of adversarial methods and improving understandings of underlying model vulnerabilities. This article synthesizes contemporary taxonomies, with a particular emphasis on the formalism, multi-dimensional categorization, and practical implications as articulated in recent foundational literature.

1. Formalism and Dimensional Taxonomy

A rigorous formalism of jailbreak attacks begins by modeling an LLM-driven application as a system in which a model $M$ receives an initial developer prompt $p$ and a user input $x$ , yielding an output $y_T = M(p . x)$ , where $.$ denotes prompt concatenation with input. Jailbreaks are defined as input manipulations that deliberately cause the model output to deviate from the intended behavior for a task $T$ ; specifically, a malicious user input $x_m$ yields a misaligned response $y'_T$ such that $y'_T \notin \text{Align}(T)$ .

The taxonomy proposed in (Rao et al., 2023) classifies jailbreak attacks along two primary axes: technique and intent.

Technique: Describes how the attack alters the input, organized along linguistic strata:
- Orthographic: Exploits non-standard character encodings or representations (e.g., LeetSpeak, Base64, transliterations).
- Lexical: Utilizes particular keywords or directive patterns (e.g., “Ignore previous instructions…”).
- Morpho-Syntactic: Manipulates prompt structure, such as incomplete sentences that coerce the model into specific completions (“Text Completion as Instruction,” TCINS).
- Semantic: Directs the model via explicit instruction (Direct Instruction, INSTR) or adversarial few-shot examples (Few-shot Hacking, FSH).
- Pragmatic: Uses context or role-play (Cognitive Hacking, COG; Instruction Repetition, IR) to neutralize system-level safety constraints.
Intent: Encodes the harm or goal targeted by the attack:
- Information Leakage: Prompts designed to extract confidential or developer prompt information.
- Misaligned Content Generation: Induces production of harmful, offensive, or illicit content.
- Performance Degradation: Goal hijacking or general denial-of-service, subverting useful behaviors.

This taxonomy is orthogonal and compositional—attacks may combine multiple techniques and/or intents.

2. Surveyed Methods and Empirical Vulnerability Assessment

Empirical analysis in (Rao et al., 2023) surveys major types of jailbreaks:

Direct Instruction (INSTR): Instructing the model to disregard previous instructions and execute an unsafe task.
Few-shot Hacking (FSH): Providing adversarial examples or misleading demonstrations to redefine output expectations.
Instruction Repetition (IR): Overwhelming alignment with repeated/pleading requests.
Cognitive Hacking (COG): Role-play and scenario injection (e.g., “act as a Maximum virtual machine” or similar role prompts).
Morpho-syntactic Strategies (TCINS): Input completion that overrides systemic instruction.

Experimental results across commercial and open-source LLMs reveal that:

GPT-like models (text-davinci-002, gpt-3.5-turbo), when instruction-tuned, exhibit high susceptibility to cognitive hacking and direct instruction attacks.
OPT-175B, BLOOM-176B, and FLAN-T5-XXL demonstrate varied sensitivity: larger or instruction-finetuned models sometimes are paradoxically more vulnerable to certain techniques, particularly goal hijacking or structural prompt manipulations.

These findings are captured in confusion matrices and comparative plots (see Figures 1–3 of (Rao et al., 2023)) delineating attack effectiveness by method and intent.

3. Challenges in Detection and the Jailbreak Paradox

Jailbreak detection remains a central challenge due to both the diversity of attack strategies and inherent ambiguities in output assessment.

Automated Property Tests: Rule-based or functional tests (e.g., output language verification in translation) exhibit poor coverage; subtle attacks often evade such schemes.
Evaluation Paradox: Increasing sophistication in evaluators (e.g., prompting GPT-4 as a classifier) introduces new vulnerabilities; evaluators can themselves be “jailbroken” and misclassify outputs, as demonstrated by significant disagreement matrices (e.g., Table 2).
Manual Annotation: Human raters display subjectivity in determining misalignment or intent satisfaction, highlighting the inherent difficulty in clear-cut detection.

Recommended strategies include fusion approaches (property tests, LLM-assisted manual review, intent scanning), recognizing that composite coverage is necessary to address the vast output space of possible jailbreak responses.

4. Comprehensive Dataset Release

To consolidate empirical research and facilitate reproduction, (Rao et al., 2023) introduces a public dataset:

Dimension	Description
Prompts	3,700 distinct jailbreak prompts across translation, classification, code, summarization
Tasks	4 major NLP tasks, with base inputs from standard datasets (e.g., CNN/Daily Mail, WMT)
Attack techniques	55 curated jailbreak attack patterns, annotated for technique and intent
Model coverage	Commercial (GPT-based), open-source (OPT, BLOOM, FLAN-T5-XXL) outputs
Evaluation labels	Automated property tests, human ratings for misalignment and intent

This corpus underpins standardization efforts for attack and defense evaluation in the community.

5. Comparative Analysis and Implications for LLM Design

The robust taxonomy and empirical findings have key implications for future model design:

No current alignment/defense mechanism is comprehensive across all techniques and intents: RLHF or instruction tuning can inadvertently increase vulnerability to cognitive hacking or compositional attacks.
Composite attack design: The decoupled taxonomy supports understanding of hybrid attacks (e.g., semantic-pragmatic hybrids combining direct instruction with role-play).
Jailbreak Paradox: Defensive evaluators and detection models can themselves learn to recognize and execute subversive instructions. Defensive architectures must anticipate recursive vulnerabilities.

These points underscore the need for multi-layered detection strategies, improved prompt sanitization, defensive prompt engineering, and explainable alignment systems.

6. Future Directions and Need for Ongoing Taxonomic Evolution

The paper highlights several avenues for advancement:

Detection: Developing robust, multi-modal, and ensemble-based detection strategies that combine intent analysis, property checks, and human-in-the-loop assessment.
Defensive Prompting and Training: Research into preprocessing and model-side defensive strategies—prompt sanitization, dynamic alignment boundaries, and hardening against linguistic manipulation.
Robustness to Compositional/Vision-based Attacks: Continued updating of the taxonomy to encompass multimodal, cross-lingual, or contextually embedded attacks as new vulnerabilities surface.
Explainability and Forensics: Improved interpretability mechanisms to trace internal state misalignments and explain failure modes to developers and evaluators.

Sustained taxonomic refinement will be necessary as adversarial strategies and model architectures continue to co-evolve.

7. Formalization and Illustrative Schema

The central formal equation is

$y_T = M(p . x)$

encapsulating the full augmented prompt and output relationship across benign and malicious input regimes. Figures, such as the attack pipeline schematic in (Rao et al., 2023), clarify the roles of developer, benign user, and malicious user, and how each interacts with the system and its vulnerabilities.

Schematic diagrams of the taxonomy (e.g., “technique” vs. “intent” axes) provide a clear visual summary of the compositional structure for attack classification as adopted in the literature.

The taxonomy outlined in (Rao et al., 2023) frames jailbreak attacks as a dual-axis composition of input manipulation techniques and adversarial intents, supported by empirical data across a spectrum of modern LLMs, detailed case studies, and a substantial public dataset. This structure underpins both present understanding and future developments in LLM adversarial alignment research.

PDF Markdown Chat (Pro)

References (1)

Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks (2023)

Follow Topic

Get notified by email when new papers are published related to Jailbreak Attack Taxonomy.