Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models (2510.22085v1)

Published 24 Oct 2025 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: LLMs remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecurity applications. We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts in a one-shot manner. Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process, enabling proactive vulnerability assessment in AI-driven security systems. Developed for the OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out test set of 200 items. Cross-model evaluation reveals significant variation in vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad applicability and model-specific defensive strengths in cybersecurity contexts. This represents a 54x improvement over direct prompting (1.5% ASR) and demonstrates systematic vulnerabilities in current safety alignment approaches. Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable, highlighting threats to AI-integrated threat detection, malware analysis, and secure systems, while physical harm categories show greater resistance (55.6% ASR). We employ automated harmfulness evaluation using Claude Sonnet 4, cross-validated with human expert assessment, ensuring reliable and scalable evaluation for cybersecurity red-teaming. Finally, we analyze failure mechanisms and discuss defensive strategies to mitigate these vulnerabilities in AI for cybersecurity.

Summary

The paper automates the generation of narrative-based jailbreak prompts using fine-tuned Mistral-7B, achieving an 81.0% attack success rate against GPT-OSS-20B.
It systematically maps cross-model vulnerabilities, revealing both universal weaknesses and model-specific defensive anomalies across various LLMs.
The study introduces a hybrid evaluation framework combining automated and expert assessments to validate AI safety flaws in large language models.

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for LLMs

Introduction

The paper presented in "Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for LLMs" (2510.22085) addresses the persisting vulnerabilities of LLMs to contextual prompt engineering attacks. By transforming adversarial prompt discovery into a systematic process, this research highlights the fragility of prevalent AI safety mechanisms against creative contextual manipulation. The proposed methodology automates the generation of narrative-based jailbreak prompts using parameter-efficient fine-tuning of smaller LLMs (specifically, Mistral-7B) with curated datasets, achieving notable success rates against several prominent LLMs.

Key Contributions

Automated Jailbreak Generation: The paper successfully demonstrates the automation of adversarial prompt crafting by employing supervised fine-tuning, significantly outperforming manual and zero-shot baselines with a 54× improvement over direct prompting. The methodology achieved an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B and proved its efficacy across various LLMs, including GPT-4 and Llama-3.

Cross-Model Vulnerability Mapping: The research provides a comprehensive analysis of vulnerabilities across major LLM families, identifying both universal weaknesses and model-specific defensive anomalies. Notably, Gemini 2.5 Flash exhibited enhanced robustness with only a 33% ASR, while demonstrating systematic vulnerabilities in cybersecurity contexts (93% ASR for technical domains) and deception-based attacks.

Rigorous Evaluation Methodology: Introducing a hybrid evaluation process combining automated harmfulness evaluation with human expert cross-validation, the paper ensures reliable and scalable jailbreak assessment. This approach underscores the inadequacy of current content-based safety mechanisms in handling sophisticated contextual misdirections.

Methodology

The methodology involves three core components:

Dataset Curation: Constructing a high-quality dataset of harmful goals and narrative reframings, systematically validated and refined against diverse LLMs to expose a broad spectrum of adversarial behaviors.
Attacker Model Training: Utilizing parameter-efficient fine-tuning, the Mistral-7B model was trained to generate prompt variants capable of bypassing safety protocols, leveraging adaptive techniques like LoRA for computationally light yet effective model adaptations.
Automated Evaluation Framework: Systematic testing of generated prompts against target models and measuring success rates are integral to validating the model's robustness and operational efficacy.

Experimental Results

The experiments reveal stark differences in model vulnerabilities when subjected to narrative-based reframings. While technical domains and creative contexts consistently expose significant vulnerabilities, physical harm categories display more resilience. Additionally, the paper shows that improvements to defense mechanisms must extend beyond simple reactive measures, as adversarial generation now operates at an industrial scale due to automation.

Implications for AI Safety

The implications of these findings are profound for AI safety, challenging the effectiveness of static content-detection paradigms. The paper calls for enhanced context-aware safety architectures capable of dynamic contextual analysis and integrating safety into core model objectives, rather than relying on after-the-fact filtering.

Conclusion

The paper "Jailbreak Mimicry" offers an authoritative approach to understanding and automating vulnerability discovery in LLMs. By revealing systemic vulnerabilities across major model families, it underscores the need for advanced defensive strategies in AI safety. The research advocates for holistic, scalable defense mechanisms and reinforces the urgency of proactive model safety alignment in the context of growing adversarial capabilities.