WaltzRL Framework: Dual-Agent RL for Safety

Updated 16 October 2025

WaltzRL is a multi-agent reinforcement learning framework designed for LLM safety alignment by dynamically balancing harmful content minimization and practical informativeness.
It employs a two-agent architecture with a conversation agent and a feedback agent, using a novel Dynamic Improvement Reward (DIR) mechanism to guide corrective revisions.
Empirical results demonstrate significant reductions in attack success and over-refusal rates, ensuring enhanced safety without compromising general language performance.

The WaltzRL framework constitutes a multi-agent reinforcement learning (RL) paradigm for LLM safety alignment, explicitly designed to balance the dual imperatives of minimizing harmful content (harmlessness) and maximizing practical informativeness (helpfulness). Unlike traditional safeguard systems that enforce unconditional rejection policies when unsafe content is detected, WaltzRL adopts a cooperative, dynamic feedback-based approach: a conversation agent generates initial responses to prompts, and a feedback agent, incentivized by a Dynamic Improvement Reward (DIR), adaptively guides the conversation agent to revise outputs deemed either unsafe or excessively cautious. This process achieves nuanced behavioral alignment without sacrificing general capabilities or efficiency.

1. Two-Agent Architecture and Cooperative Objectives

WaltzRL's foundational structure is a partnership between two agents with shared alignment objectives:

Conversation agent: Generates the initial response to a user prompt, employing general language modeling strategies.
Feedback agent: Evaluates the conversation agent’s output for safety (avoiding harmful or policy-violating content) and for overrefusal (pointlessly rejecting benign, sensitive queries). When issues are identified, the feedback agent produces actionable suggestions to correct the response.

Distinct from adversarial approaches, both agents are trained to maximize a joint objective—outputs that pass both safety and helpfulness criteria. The feedback agent is rewarded for both detection of problematic responses and, crucially, the degree to which its guidance improves subsequent conversation agent outputs.

2. Dynamic Improvement Reward (DIR) Mechanism

Central to WaltzRL is the DIR mechanism, which governs the feedback agent’s reinforcement signal. The DIR quantifies improvement by measuring the difference in reward before and after the conversation agent incorporates feedback. Let $R_c(\text{new})$ and $R_c(\text{old})$ denote the reward function for revised and initial responses, respectively; then

$\text{DIR} = R_c(\text{new}) - R_c(\text{old})$

This dynamic, time-evolving reward formulation ensures that as the conversation agent’s base policy improves, the feedback agent’s incentives and expected value evolve in kind. The DIR not only rewards identification of safety violations or overrefusal but also calibrates the feedback agent to generate more effective, actionable guidance.

The reward function $R_c$ is typically defined as an indicator function:

$R_c(\text{state}, \text{response}) = \mathbb{1}\{\text{response is safe and not overrefusing}\}$

Thus, feedback is only recognized as valuable to the extent that it transforms initially problematic responses into state-compliant ones.

3. Training and Inference Protocols

Training

WaltzRL employs a multi-agent reinforcement learning regimen, generally following a two-stage or collaborative rollout protocol:

Feedback agent pre-training: The conversation agent’s policy is held fixed; the feedback agent learns to assign safety and overrefusal labels and to produce corrective feedback.
Joint collaborative training: Both agents are updated simultaneously (typically via policy-gradient methods such as extensions of REINFORCE++). For each prompt, the conversation agent generates a response, the feedback agent intervenes if necessary, and the conversation agent revises its output. DIR is computed per trajectory, shaping both agents' learning.

Trajectories consist of sequences of prompts, responses, feedback suggestions, and revisions, with associated rewards determined by safety, helpfulness, and fine-grained improvement via DIR. This process encourages both detection and remediation, enforcing inter-agent cooperation.

Inference

At deployment, the system operates adaptively. For a given prompt, the conversation agent produces its response; the feedback agent evaluates output for both policy violations and unnecessary refusals. Feedback is triggered only when issues are detected, ensuring minimal latency for compliant queries. Revisions are made based on feedback, with silent operation from the feedback agent on already safe/helpful responses.

4. Safety and Helpfulness Assessment Metrics

WaltzRL’s evaluation regime employs quantifiable metrics across safety and informativeness dimensions:

Metric	Definition	Desired Outcome
Attack Success Rate (ASR)	Frequency of unsafe outputs on adversarial prompts	Lower value (↑ safety)
Over-Refusal Rate (ORR)	Frequency of refusals on benign, sensitive queries	Lower value (↑ helpfulness)
Feedback Trigger Rate (FTR)	Proportion of queries requiring feedback	Moderate, for efficiency

These are computed on standardized and diverse datasets, measuring both robustness to adversarial attacks and responsiveness to challenging but legitimate information requests. Additional validation occurs via instruction-following and language understanding benchmarks to confirm that safety alignment does not degrade general performance.

5. Empirical Results and Capabilities Preservation

Empirical analysis over five heterogeneous datasets demonstrates substantial gains relative to baseline systems:

On adversarial datasets (e.g., WildJailbreak), ASR was reduced from 39.0% (baseline) to 4.6% (WaltzRL), indicating marked improvement in avoiding unsafe outputs.
On overrefusal benchmarks (e.g., OR-Bench), ORR dropped from 45.3% (baseline) to 9.9% (WaltzRL), substantiating enhanced informativeness on benign queries.

Notably, instruction-following and general language comprehension benchmarks confirmed negligible degradation in general capabilities. The feedback agent’s adaptive engagement ensures low overhead: feedback is only issued when necessary, preserving latency and throughput for safe queries.

6. Broader Significance and Future Directions

WaltzRL advances the field of LLM safety alignment by recasting the safety-refusal tradeoff as a positive-sum collaborative RL game. The co-evolution of conversation and feedback agents enables a dynamic advancement of the Pareto frontier between helpfulness and harmlessness. The DIR structure motivates the feedback agent not only to identify issues but to actively optimize the quality of its guidance, shaping conversation agent policy improvement in situ.

Plausible future directions, suggested by the framework’s adaptive capacity, include:

Extending the framework to multi-round, iterative feedback, potentially boosting correction coverage.
Developing generalist feedback agents interoperable with a broad array of conversation agents.
Refining reward structures and exploration strategies to enhance convergence and optimize sample efficiency in RL training.
Scaling deployment to large, industrial models and exploring cross-modal feedback integration.

This approach signifies a shift from static rejection paradigms to dynamic feedback orchestration in LLM safety alignment, with experimental evidence for simultaneous safety enhancement and preservation of broad capabilities.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WaltzRL Framework.