Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models (2506.07468v1)

Published 9 Jun 2025 in cs.LG, cs.CL, and cs.MA

Abstract: Conventional LLM (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

Summary

The paper presents Self-RedTeam, a novel online self-play reinforcement learning approach that models language model safety as a zero-sum game.
The paper demonstrates significant empirical improvements, with attackers uncovering 21.8% more diverse adversarial prompts and defenders achieving a 65.5% boost on safety benchmarks.
The paper employs a hidden Chain-of-Thought mechanism and game theory to dynamically balance safety and efficacy, reducing over-refusal in language models.

Online Self-Play Reinforcement Learning for Safer LLMs

The paper "Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer LLMs" introduces a novel approach aimed at enhancing the safety alignment of LMs using Self-RedTeam, an online self-play reinforcement learning algorithm. The approach centers on transforming the conventional static methods into a dynamic, interactive framework where the safety alignment is treated as a two-player zero-sum game between attacker and defender roles using a single model. This innovative strategy potentially addresses the lag inherent in the reactive patching methodologies that have characterized previous LM safety solutions.

Theoretical and Practical Contributions

Self-RedTeam's underpinning principle is the application of game theory to model LM safety alignment as a zero-sum game, where an attacker proposes adversarial prompts and a defender safeguards against them. A key theoretical contribution of this paper is the establishment of a Nash Equilibrium within this game-theoretic framework, wherein the defender can reliably produce safe responses to any adversarial input, if convergence is achieved. This theoretical guarantee underlines the robustness and efficacy of the proposed method.

Empirically, Self-RedTeam has demonstrated significant improvements in discovering diverse adversarial attacks and bolstered defensive strategies. The results show that attackers uncover 21.8% more diverse attacks when co-evolving with defenders compared to static defenders. Defenders trained via self-play demonstrate higher robustness on safety benchmarks (e.g., a 65.5% improvement on WildJailBreak) over defenders in static attacker conditions. These notable empirical results highlight the potential for dynamic role alternation in achieving scalable and robust LM safety alignment.

Key Methodological Aspects

The methodology proposed involves a structured reasoning aspect called hidden Chain-of-Thought (CoT), enabling agents to develop strategic plans without revealing reasoning traces to opponents. This asymmetric visibility fosters greater adversarial diversity and reduces instances of over-refusal—a common pitfall where models reject even innocuous queries due to overly aggressive safety tuning.

Evaluation and Implications

Evaluation across multiple benchmarks like HarmBench and WildGuardTest has affirmed the practical strength of Self-RedTeam in mitigating harmful content generation while maintaining high instructional-following ability. Interestingly, the paper elucidates a mitigation of over-refusal behavior on benign prompts, signifying improved balance between safety and efficacy in real-world application scenarios.

Future Perspectives and Impact

While the approach demonstrates significant strides toward proactive LM safety, the scalability of the self-play framework suggests broader applications in AI model robustness beyond just LLMs. The theoretical groundwork laid might pave the way for further research in adaptive learning models capable of self-optimizing through competitive interactions.

Self-RedTeam, by harnessing game theory and reinforcement learning, marks a shift toward more sustainable LM safety alignment strategies. Future research could explore expanding this methodology across multi-turn interactions and integrating this framework into other AI domains, enhancing both safety and reliability in increasingly complex deployment environments.

Overall, the paper presents a compelling argument for reconsidering how LLMs can dynamically adapt to threats, thereby fostering more robust AI systems capable of autonomously improving their alignment with safety protocols.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (7)

Tweets

https://twitter.com/liweijianglw/status/1932920900493656262

YouTube

Show All Videos