Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Published 25 Aug 2025 in cs.CR and cs.AI | (2508.19292v1)

Abstract: LLMs generate human-aligned content under certain safety constraints. However, the current known technique ``jailbreak prompt'' can circumvent safety-aligned measures and induce LLMs to output malicious content. Research on Jailbreaking can help identify vulnerabilities in LLMs and guide the development of robust security frameworks. To circumvent the issue of attack templates becoming obsolete as models evolve, existing methods adopt iterative mutation and dynamic optimization to facilitate more automated jailbreak attacks. However, these methods face two challenges: inefficiency and repetitive optimization, as they overlook the value of past attack experiences. To better integrate past attack experiences to assist current jailbreak attempts, we propose the \textbf{JailExpert}, an automated jailbreak framework, which is the first to achieve a formal representation of experience structure, group experiences based on semantic drift, and support the dynamic updating of the experience pool. Extensive experiments demonstrate that JailExpert significantly improves both attack effectiveness and efficiency. Compared to the current state-of-the-art black-box jailbreak methods, JailExpert achieves an average increase of 17\% in attack success rate and 2.7 times improvement in attack efficiency. Our implementation is available at \href{https://github.com/xiZAIzai/JailExpert}{XiZaiZai/JailExpert}

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces JailExpert, a framework that automates jailbreak attacks on LLMs by formalizing historical attack experiences.
JailExpert employs experience formalization, semantic drift analysis, and a target-preference strategy to achieve up to a 17% improvement in attack success.
Experimental results demonstrate that JailExpert outperforms baselines by enhancing efficiency up to 2.7 times and attaining a 90% attack success rate.

Building JailExpert from Previous Attack Experience

Introduction

The study titled "Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience" (2508.19292) investigates the automation of jailbreak attacks on LLMs. The paper introduces "JailExpert," a novel framework that leverages historical attack data to enhance the effectiveness and efficiency of these attacks. Given the iterative mutation and dynamic optimization methods previously employed, JailExpert aims to overcome limitations related to inefficiency and redundant optimization by exploiting past attack experiences.

JailExpert's formulation allows for dynamic updates and the formal representation of experience structures, which mitigates the traditional challenges associated with jailbreak template obsolescence. Extensive experiments affirm that JailExpert outperforms existing black-box jailbreak methods, achieving an average 17% increase in attack success rate and enhancing attack efficiency by 2.7 times.

Methodology

JailExpert is constructed upon three core components: experience formalization, jailbreak pattern summarization, and experience attack and update.

Experience Formalization: Inspired by Case-Based Reasoning (CBR), JailExpert stores past jailbreak experiences in a structured format (e = (\mathcal{I}, \mathcal{J}, A, s, f)). This includes initial instructions, complete jailbreak prompts, mutation strategies, and historical success/failure data, fostering a dynamic and adaptable system.
Jailbreak Pattern Summarization: To alleviate the inefficiency arising from vast experience pools, JailExpert employs semantic drift analysis. By computing the semantic difference between initial instructions and complete prompts, JailExpert clusters experiences into groups that encapsulate core attack patterns.
Figure 1: The demonstration of jailbreak semantic drift used for efficient experience grouping.
Experience Attack and Update: JailExpert uses a target-preference guide strategy to select optimal experiences for new attacks. It sequentially adjusts and updates experiences based on real-time results, ensuring adaptability and efficiency in evolving environments.

Experimental Results

The experimental framework rigorously tests JailExpert against prominent LLMs, including LLaMA series, various GPT models, and others. JailExpert consistently demonstrates superior attack success rates and efficiency compared to contemporary methods.

Figure 2: The comprehensive overview of the JailExpert framework.

Key findings include:

Attack Success Rate (ASR): JailExpert achieves an ASR up to 90%, significantly outperforming baselines which hover around 70%.
Attack Efficiency: JailExpert displays remarkable efficiency improvement, achieving results up to 67 times better than methods like GPTFuzzer.
Transferability and Flexibility: Even without target-specific experiences, JailExpert transfers knowledge effectively across different LLMs, maintaining high performance.

Implications and Future Directions

JailExpert's ability to dynamically incorporate and iterate upon past experiences sets a new standard for automated jailbreak frameworks. This research highlights critical vulnerabilities in LLMs, emphasizing the urgent need for improved security measures.

The automation and enhanced efficiency of JailExpert could influence both offensive strategies and defensive measures in AI safety research. Future developments may focus on extending JailExpert's adaptability to different LLM architectures and expanding the scope of experiences that can be integrated.

Conclusion

JailExpert presents a significant advancement by formalizing an experience-based approach to jailbreak attacks on LLMs. Through structured experiences, semantic drift analysis, and dynamic updates, JailExpert not only strengthens attack efficiency but also poses a substantial challenge to existing defense mechanisms. Its success paves the way for exploring more robust and adaptable security frameworks in AI systems.