- The paper presents a novel framework using multi-agent collaboration to systematically uncover multi-turn vulnerabilities in language models.
- It employs a two-phase process with strategic planning and adaptive text optimization, achieving state-of-the-art attack success rates up to 96.2% on benchmark models.
- The study also introduces the XGuard dataset, a large-scale open-source resource that enhances model robustness against complex iterative jailbreak attacks.
The paper "X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents" (2504.13203) addresses the critical and often underexplored safety risks associated with multi-turn interactions with LMs. Unlike single-turn attacks, multi-turn jailbreaks can distribute malicious intent across several exchanges, making them harder to detect and prevent. The paper introduces X-Teaming, a scalable framework designed to systematically uncover these vulnerabilities by emulating human red-teaming strategies using collaborative AI agents.
The core of the X-Teaming framework is a two-phase iterative process:
- Strategic Attack Planning: A Planner agent generates a diverse set of attack plans for a given harmful behavior. Each plan includes a persona, context, overall strategy, and a turn-by-turn conversation trajectory designed to escalate towards the harmful outcome. The Planner can generate multiple sets of plans, using previous outputs to encourage diversity in persona, context, and approach.
- Adaptive Attack Execution and Optimization: For each plan, an Attacker agent initiates a multi-turn conversation with the target LM. A Verifier agent scores the model's response at each turn (on a scale of 1 to 5, where 5 is full compliance). If the score decreases, a Prompt Optimizer uses TextGrad-based text optimization to refine the Attacker's query to increase the likelihood of success. If a plan's trajectory is exhausted without success, the Planner can adapt and extend the plan based on the conversation history and verifier feedback. The attack is successful if any turn achieves a score of 5.
The framework utilizes specific models for its components: GPT-4o serves as the Planner and Verifier, while Qwen-2.5-32B-IT acts as the Attacker and Prompt Optimizer. The experiments were conducted on the HarmBench benchmark (Mazeika et al., 6 Feb 2024), a standard for evaluating automated red teaming.
The paper demonstrates the practical effectiveness of X-Teaming through several key results:
- High Attack Success Rates (ASR): X-Teaming achieved state-of-the-art ASRs across various leading open-weight and closed-source LMs, including GPT-4o, Claude 3.5/3.7 Sonnet, Gemini 2.0 Flash, and Llama 3 variants. It significantly outperformed previous single-turn and multi-turn methods. Notably, it achieved a 96.2% ASR against the robust Claude 3.7 Sonnet and 91.8% against Llama-3-8B-Instruct fine-tuned on SafeMTData, a dataset specifically for multi-turn safety.
- Improved Attack Diversity: The framework generated significantly more diverse attack plans and executed more varied queries compared to the previous state-of-the-art multi-turn method, ActorAttack (Ren et al., 14 Oct 2024). This diversity allows for a more comprehensive exploration of potential vulnerabilities.
- Resource Efficiency: Successful attacks typically required around 4 conversation turns, and the average token usage remained well within the context windows of the tested models. TextGrad optimization attempts were effective, with significant improvements seen after just one or two iterations when a score drop occurred.
- Verifier Reliability: An analysis showed strong agreement (84.50% average) between the GPT-4o verifier and HarmBench test classifiers, supporting its use for evaluation in this context.
Building on the attack capabilities, the paper introduces XGuard [huggingface.co/datasets/marslabucla/XGuard-Train], a large-scale (30K conversations) open-source multi-turn safety training dataset. Generated using the X-Teaming framework, XGuard provides diverse, interactive jailbreaks. Models fine-tuned on XGuard (specifically Llama-3.1-8B and Qwen-2.5-7B) demonstrated enhanced resistance to multi-turn attacks while preserving single-turn safety and general capabilities, outperforming models trained on smaller or less diverse datasets like SafeMTData.
For practical implementation, the framework requires deploying multiple specialized LLM agents that can communicate and coordinate. The Planner needs to generate diverse strategies, potentially requiring a powerful model like GPT-4o. The Attacker needs to maintain persona and conversation flow, while the Verifier provides real-time feedback. The Prompt Optimizer leverages techniques like TextGrad, which involves iterative refinement of text prompts based on feedback. Computational requirements would scale with the number of harmful behaviors tested, the number of plans generated per behavior, the maximum conversation turns, and the number of optimization attempts. The framework's open-source nature allows practitioners to adapt and deploy these agents.
In essence, the paper provides both a powerful red-teaming tool (X-Teaming
) to uncover multi-turn vulnerabilities and a valuable resource (XGuard
) to train models to be more robust against such attacks, contributing significantly to advancing conversational AI safety. The dual-use nature of the work is acknowledged, with the authors emphasizing the importance of open research for developing defenses.