Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Published 4 Nov 2025 in cs.CL, cs.AI, cs.CR, and cs.LG | (2511.02376v2)

Abstract: LLMs remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

Summary

  • The paper presents a multi-turn adversarial prompting framework that improves LLM jailbreaking success rates, achieving up to 95% ASR on Llama-3.1-8B.
  • It employs adaptive prompt refinement and dynamic temperature adjustments to optimize adversarial strategies and exploit alignment vulnerabilities.
  • The study reveals that single-turn safety measures are inadequate, highlighting the urgent need for robust, multi-turn defenses in future LLMs.

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of LLMs

Introduction

The paper "AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of LLMs" examines the vulnerabilities of LLMs to adversarial attacks that manipulate models into producing harmful or restricted outputs. Centered on the AutoAdv framework, this research highlights the critical gap between single-turn evaluations and the more realistic multi-turn adversarial exchanges that occur in practice. The study underscores the urgent need for models that effectively handle extended interactions to protect against evolving techniques in adversarial prompting.

AutoAdv Framework and Methodology

AutoAdv presents a training-free framework designed to automate multi-turn adversarial interactions with LLMs. It distinguishes itself through its combination of adaptive prompt refinement processes and dynamic management strategies. The attack unfolds in two phases: initial prompt rewriting, which obscures the harmful intent into a seemingly innocuous inquiry, and adaptive follow-up interactions that leverage learning from prior exchanges. Key components include a pattern manager that refines future prompts based on successful strategies and a temperature manager that adjusts generative sampling parameters. These elements collectively enhance the potential for jailbreaking success across multiple turns, demonstrating a significant improvement in attack success rates (ASR) as compared to traditional single-turn methodologies.

Key Findings

Experiments revealed that AutoAdv achieved up to a 95% ASR on Llama-3.1-8B in six turns, marking a notable 24% increase over single-turn baselines. The effectiveness was consistent across other commercial and open-source LLMs, indicating a pervasive weakness in alignment strategies currently employed. These results point to the inadequacy of defenses optimized for single interactions when faced with protracted adversarial dialogues.

The study further emphasizes the divergence in susceptibility among different LLMs, with Qwen3-235B exhibiting the greatest vulnerability across multiple turns. Furthermore, the detailed analysis of several adaptive mechanisms illustrates the considerable advantage multi-turn attacks hold in refining adversarial strategies iteratively, exploiting knowledge accumulated over consecutive interactions.

Methodological Components

AutoAdv's framework is rigorously structured through an array of meticulously designed modules:

  1. Pattern Manager: This component acts as a learning module to enhance jailbreak strategies using a repository of successful past interactions. The manager's dynamic enhancement of system prompts, tailored using proven techniques in role-playing and educational prompts, significantly raises the adaptability and effectiveness of adversarial attempts.
  2. Temperature Manager: By iteratively adjusting the sampling temperature of the attacker LLM based on recent interaction outcomes, this manager ensures optimal generation adaptiveness. Strategies vary from adaptive adjustments to exploitative and exploratory changes, all finely tuned to drive the system toward achieving its adversarial goals under differing conditions.
  3. Scoring Framework: The study employs an advanced scoring system, leveraging the StrongREJECT framework to evaluate the efficacy of each attack in the context of refusal detection and response quality. This continuous feedback mechanism not only determines the success or failure of jailbreak attempts but also refines future prompts and strategy selections.
  4. Prompt Generation Guidelines: AutoAdv incorporates comprehensive guidelines for generating initial and follow-up adversarial prompts. It delineates methods to obscure malicious intent initially and provides structured strategies for adaptation after encountering model defenses.

Implications and Future Work

AutoAdv's advancements reveal profound implications for the future of LLM security. The research highlights vulnerabilities in existing safety measures and illustrates the efficiency of multi-turn evaluations in uncovering these weaknesses. Going forward, the integration of such robust multi-turn strategies into alignment infrastructures could profoundly enhance model resilience against adversarial manipulation. Beyond illuminating existing model vulnerabilities, AutoAdv sets a foundational path for developing sophisticated, multi-turn-aware defenses that could evolve in tandem with adversarial innovations.

Future research may further explore the broader application of AutoAdv's principles to varied model architectures and how integrating multimodal and cross-lingual dimensions could bolster its effectiveness. Additionally, ethical considerations and secure applications remain paramount to ensure the benefits of these advanced techniques align with responsible AI development.

Conclusion

AutoAdv establishes a comprehensive methodology for probing and understanding LLM vulnerabilities in adversarial settings. By prioritizing multi-turn interactions, this study challenges the current approach to LLM safety, emphasizing the critical need for defenses that adapt dynamically over sustained engagements. This framework not only highlights existing deficiencies but also paves the way for next-generation alignment strategies that may augment model integrity in an increasingly adversarial AI landscape.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.