MUA-RL: Multi-turn Agent RL
- MUA-RL is a reinforcement learning framework that integrates LLM-based user simulators in multi-turn dialogues to enhance agentic tool use.
- It models interactions as an MDP, capturing dialogue history and database responses to iteratively refine queries and tool calls under dynamic user intents.
- Empirical benchmarks show that incorporating GRPO and cold-start pretraining significantly improves task success rates across diverse multi-turn scenarios.
MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning) is a reinforcement learning framework developed to advance agentic tool use in LLMs within dynamic multi-turn dialogues involving simulated user agents. The principal innovation of MUA-RL lies in integrating LLM-based user simulators directly into the RL training loop, thereby enabling agents to autonomously learn optimal communication and tool-using strategies in environments where user intents evolve stochastically and multi-turn task resolution requires iterative clarification and tool invocation (Zhao et al., 26 Aug 2025).
1. Motivation and Problem Formulation
The need for MUA-RL arises from fundamental limitations of existing supervised fine-tuning (SFT) and RL-from-static script approaches to tool-using LLMs. Traditional SFT exposes models only to static single-turn or fixed multi-turn trajectories, which lack the feedback-driven dynamism found in real-world task-solving dialogue—such as retail customer support or flight booking, where user requirements persistently co-evolve with agent responses.
Reinforcement learning enables exploration of diverse tool calling and communication policies beyond those which are demonstration-imitation–driven. However, prior RL approaches typically neglect the integration of responsive users—limiting their rollout environments to pre-scripted, uni-directional user-agent interactions or relying on single-turn reward signals. MUA-RL overcomes this by embedding a strong LLM-based user simulator within the RL rollout, featuring both autonomous update of user intent and flexible agent-driven iterative refinement (Zhao et al., 26 Aug 2025).
2. RL Framework Structure
The MUA-RL framework casts the multi-turn agentic tool use problem as a Markov Decision Process (MDP) characterized by:
State Space : At turn , the state comprises the full dialogue history up to (user utterances ) and database (DB) observations from past tool calls .
Action Space : At each turn, the agent may either:
- Emit a textual message to the user (for clarifying questions or updates)
- Invoke a DB tool with JSON arguments (for task-executing function calls)
Transitions: When the agent acts:
- If acting via tool call: output is the DB response ()
- If acting via message: next user utterance is generated by the LLM user simulator (0)
Rewards: A sparse, outcome-only reward 1 if, at the final dialogue turn, the agent's output meets all task specifications; zero otherwise.
Objective: Optimize the discounted expected return
2
The stochastic policy 3 parameterizes agent output, and the value function 4 supports policy improvement (Zhao et al., 26 Aug 2025).
3. LLM-Simulated User Components and Rollout Design
The LLM-simulated user, implemented with models such as GPT-4o-2024-11-20, receives agent queries and emits contextually appropriate user responses drawn from a fixed persona and goal. In domains like TAU2 Telecom, user simulators can autonomously call tools and perform database queries, leading to true joint dynamism between user and agent (Zhao et al., 26 Aug 2025).
The RL rollout proceeds as follows:
Pseudocode Outline:
- Initialize 5 user's initial query.
- For 6:
- Sample 7.
- If 8 is a tool call:
- Execute on real DB, store 9, update state.
- If 0 is a message:
- Pass to user simulator, receive 1, update state.
- Compute terminal reward 2.
Iterative Intent Refinement: The agent may ask for clarification (e.g., "Which color do you prefer?"), and simulated users evolve their prompts and constraints in response, exposing the agent to high-variance, intent-shifting task environments essential for robust tool use policy learning.
4. Agent Architecture and Optimization Routine
4.1 Model and Pretraining
- Policy Networks: Qwen3-8B/14B/32B LLMs form the backbone, with flattened dialogue history and function-calling tool schemas as inputs. Output heads can emit structured function-call JSON or free-form agent utterances.
- Cold-Start Supervision: To initialize tool fluency, 32K multi-turn trajectories are synthesized using LLM-simulated tools and real-world MCP server calls. SFT is performed for 2 epochs (batch size 128, 4, AdamW).
4.2 Reinforcement Learning: Group Relative Policy Optimization
- Algorithm: Group Relative Policy Optimization (GRPO) is employed for policy improvement. For a user query 5:
- Sample 6 candidate responses 7 under the old policy.
- Compute advantage 8.
- Optimize the clipped surrogate objective with KL regularization to prevent policy collapse and maintain exploration:
9
with 0, 1, 2.
- Loss Masking: The loss is not backpropagated through tokens emitted by the simulator or DB tools, ensuring learning occurs only from agent outputs.
- Exploration: Rollout temperature 1.0.
5. Empirical Results and Benchmarking
MUA-RL is validated on five multi-turn tool-use benchmarks:
| Benchmark | MUA-RL-32B | Comparator Models | Metric |
|---|---|---|---|
| TAU1 Retail | 72.6% | DeepSeek-V3 (70.4%) | Task Success Rate |
| TAU1 Airline | 46.5% | GPT-4.1 (42.5%) | Task Success Rate |
| TAU2 Retail | 67.3% | GPT-4.1 (70.2%) | Task Success Rate |
| TAU2 Airline | 45.4% | GPT-4o (46.9%) | Task Success Rate |
| TAU2 Telecom | 28.3% | GPT-4o (24.1%), TCR 45.1% | Task Success Rate, TCR |
| BFCL-V3 Multi Turn | 28.4% | DeepSeek-V3 (29.8%) | Executable Function Accuracy |
| ACEBench Agent | 82.5% | GPT-4.1 (86.7%) | Overall Agent Score |
Ablation studies indicate that removing cold-start pretraining results in 9–12 point drops across TAU2 and BFCL, while removing RL entirely reduces performance to near-supervised baselines (Zhao et al., 26 Aug 2025).
Training Dynamics Insights:
- KL divergence increases with RL phases, with larger models drifting more smoothly from their initialization.
- Entropy drops and then stabilizes, reflecting the transition from exploratory to exploitative policy regimes.
- Dialogue rollout turns per episode initially increase and then stabilize, indicating improved multi-turn task structure.
- Use of “generic” tools diminishes over RL phases, with a shift toward more precise, context-relevant tool invocation.
- Declining unique 4-gram ratios in larger models suggest a reliance on precise, repetitive tool usage over linguistic diversity.
6. Limitations and Prospects
- Sparse, outcome-only reward slows credit assignment; exploration of turn-level or shaped rewards may improve convergence without inducing reward hacking.
- User simulators currently represent a fixed persona; future work must address robustness to real human diversity and fluctuating profiles.
- The framework is presently limited to text-based interaction and fixed tool suites; expansion to multimodal signals and evolving tool sets is suggested.
- Integrating online RL from live users with safety interventions is a proposed route for increased real-world applicability (Zhao et al., 26 Aug 2025).
7. Significance and Impact
MUA-RL establishes a paradigm for agentic LLM training that actively incorporates the dynamism of genuine interactive dialogue and evolving user needs. By leveraging LLM-based user simulation and end-to-end multi-turn RL with state-of-the-art optimization (GRPO), it achieves competitive or superior performance relative to much larger-scale open-source models across multiple complex tool-using benchmarks. This architecture demonstrates the necessity and tractability of dynamic user modeling in agentic tool use and provides an extensible basis for further work on robust, general-purpose agent frameworks in real-world, user-facing environments (Zhao et al., 26 Aug 2025).