Papers
Topics
Authors
Recent
Search
2000 character limit reached

MUA-RL: Multi-turn Agent RL

Updated 22 April 2026
  • MUA-RL is a reinforcement learning framework that integrates LLM-based user simulators in multi-turn dialogues to enhance agentic tool use.
  • It models interactions as an MDP, capturing dialogue history and database responses to iteratively refine queries and tool calls under dynamic user intents.
  • Empirical benchmarks show that incorporating GRPO and cold-start pretraining significantly improves task success rates across diverse multi-turn scenarios.

MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning) is a reinforcement learning framework developed to advance agentic tool use in LLMs within dynamic multi-turn dialogues involving simulated user agents. The principal innovation of MUA-RL lies in integrating LLM-based user simulators directly into the RL training loop, thereby enabling agents to autonomously learn optimal communication and tool-using strategies in environments where user intents evolve stochastically and multi-turn task resolution requires iterative clarification and tool invocation (Zhao et al., 26 Aug 2025).

1. Motivation and Problem Formulation

The need for MUA-RL arises from fundamental limitations of existing supervised fine-tuning (SFT) and RL-from-static script approaches to tool-using LLMs. Traditional SFT exposes models only to static single-turn or fixed multi-turn trajectories, which lack the feedback-driven dynamism found in real-world task-solving dialogue—such as retail customer support or flight booking, where user requirements persistently co-evolve with agent responses.

Reinforcement learning enables exploration of diverse tool calling and communication policies beyond those which are demonstration-imitation–driven. However, prior RL approaches typically neglect the integration of responsive users—limiting their rollout environments to pre-scripted, uni-directional user-agent interactions or relying on single-turn reward signals. MUA-RL overcomes this by embedding a strong LLM-based user simulator within the RL rollout, featuring both autonomous update of user intent and flexible agent-driven iterative refinement (Zhao et al., 26 Aug 2025).

2. RL Framework Structure

The MUA-RL framework casts the multi-turn agentic tool use problem as a Markov Decision Process (MDP) characterized by:

State Space S\mathcal{S}: At turn tt, the state sts_t comprises the full dialogue history up to tt (user utterances o1:tusero_{1{:}t}^{\text{user}}) and database (DB) observations from past tool calls o1:tdbo_{1{:}t'}^{\text{db}}.

Action Space A\mathcal{A}: At each turn, the agent may either:

  • Emit a textual message mMm \in \mathcal{M} to the user (for clarifying questions or updates)
  • Invoke a DB tool tiTt_i \in \mathcal{T} with JSON arguments (for task-executing function calls)

Transitions: When the agent acts:

  • If acting via tool call: output is the DB response (otdbo_t^{db})
  • If acting via message: next user utterance is generated by the LLM user simulator (tt0)

Rewards: A sparse, outcome-only reward tt1 if, at the final dialogue turn, the agent's output meets all task specifications; zero otherwise.

Objective: Optimize the discounted expected return

tt2

The stochastic policy tt3 parameterizes agent output, and the value function tt4 supports policy improvement (Zhao et al., 26 Aug 2025).

3. LLM-Simulated User Components and Rollout Design

The LLM-simulated user, implemented with models such as GPT-4o-2024-11-20, receives agent queries and emits contextually appropriate user responses drawn from a fixed persona and goal. In domains like TAU2 Telecom, user simulators can autonomously call tools and perform database queries, leading to true joint dynamism between user and agent (Zhao et al., 26 Aug 2025).

The RL rollout proceeds as follows:

Pseudocode Outline:

  • Initialize tt5 user's initial query.
  • For tt6:
    • Sample tt7.
    • If tt8 is a tool call:
    • Execute on real DB, store tt9, update state.
    • If sts_t0 is a message:
    • Pass to user simulator, receive sts_t1, update state.
  • Compute terminal reward sts_t2.

Iterative Intent Refinement: The agent may ask for clarification (e.g., "Which color do you prefer?"), and simulated users evolve their prompts and constraints in response, exposing the agent to high-variance, intent-shifting task environments essential for robust tool use policy learning.

4. Agent Architecture and Optimization Routine

4.1 Model and Pretraining

  • Policy Networks: Qwen3-8B/14B/32B LLMs form the backbone, with flattened dialogue history and function-calling tool schemas as inputs. Output heads can emit structured function-call JSON or free-form agent utterances.
  • Cold-Start Supervision: To initialize tool fluency, sts_t32K multi-turn trajectories are synthesized using LLM-simulated tools and real-world MCP server calls. SFT is performed for 2 epochs (batch size 128, sts_t4, AdamW).

4.2 Reinforcement Learning: Group Relative Policy Optimization

  • Algorithm: Group Relative Policy Optimization (GRPO) is employed for policy improvement. For a user query sts_t5:

    • Sample sts_t6 candidate responses sts_t7 under the old policy.
    • Compute advantage sts_t8.
    • Optimize the clipped surrogate objective with KL regularization to prevent policy collapse and maintain exploration:

    sts_t9

    with tt0, tt1, tt2.

  • Loss Masking: The loss is not backpropagated through tokens emitted by the simulator or DB tools, ensuring learning occurs only from agent outputs.
  • Exploration: Rollout temperature 1.0.

5. Empirical Results and Benchmarking

MUA-RL is validated on five multi-turn tool-use benchmarks:

Benchmark MUA-RL-32B Comparator Models Metric
TAU1 Retail 72.6% DeepSeek-V3 (70.4%) Task Success Rate
TAU1 Airline 46.5% GPT-4.1 (42.5%) Task Success Rate
TAU2 Retail 67.3% GPT-4.1 (70.2%) Task Success Rate
TAU2 Airline 45.4% GPT-4o (46.9%) Task Success Rate
TAU2 Telecom 28.3% GPT-4o (24.1%), TCR 45.1% Task Success Rate, TCR
BFCL-V3 Multi Turn 28.4% DeepSeek-V3 (29.8%) Executable Function Accuracy
ACEBench Agent 82.5% GPT-4.1 (86.7%) Overall Agent Score

Ablation studies indicate that removing cold-start pretraining results in 9–12 point drops across TAU2 and BFCL, while removing RL entirely reduces performance to near-supervised baselines (Zhao et al., 26 Aug 2025).

Training Dynamics Insights:

  • KL divergence increases with RL phases, with larger models drifting more smoothly from their initialization.
  • Entropy drops and then stabilizes, reflecting the transition from exploratory to exploitative policy regimes.
  • Dialogue rollout turns per episode initially increase and then stabilize, indicating improved multi-turn task structure.
  • Use of “generic” tools diminishes over RL phases, with a shift toward more precise, context-relevant tool invocation.
  • Declining unique 4-gram ratios in larger models suggest a reliance on precise, repetitive tool usage over linguistic diversity.

6. Limitations and Prospects

  • Sparse, outcome-only reward slows credit assignment; exploration of turn-level or shaped rewards may improve convergence without inducing reward hacking.
  • User simulators currently represent a fixed persona; future work must address robustness to real human diversity and fluctuating profiles.
  • The framework is presently limited to text-based interaction and fixed tool suites; expansion to multimodal signals and evolving tool sets is suggested.
  • Integrating online RL from live users with safety interventions is a proposed route for increased real-world applicability (Zhao et al., 26 Aug 2025).

7. Significance and Impact

MUA-RL establishes a paradigm for agentic LLM training that actively incorporates the dynamism of genuine interactive dialogue and evolving user needs. By leveraging LLM-based user simulation and end-to-end multi-turn RL with state-of-the-art optimization (GRPO), it achieves competitive or superior performance relative to much larger-scale open-source models across multiple complex tool-using benchmarks. This architecture demonstrates the necessity and tractability of dynamic user modeling in agentic tool use and provides an extensible basis for further work on robust, general-purpose agent frameworks in real-world, user-facing environments (Zhao et al., 26 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MUA-RL.