Multi-Turn Reinforcement Learning

Updated 15 September 2025

Multi-Turn Reinforcement Learning is an iterative framework where agents optimize actions across extended, sequential interactions using delayed rewards and advanced credit assignment techniques.
It employs hierarchical and token-level approaches to decompose decision processes, significantly improving sample efficiency and stability in long-horizon tasks.
This paradigm underpins applications in dialogue systems, software engineering, and tool use by effectively managing partial observability and gradient optimization challenges.

Multi-Turn Reinforcement Learning (RL) refers to the application and development of reinforcement learning methodologies in environments and tasks where agents repeatedly interact across multiple rounds, turns, or conversational cycles, updating their strategies and leveraging sequential feedback rather than operating in a single, isolated decision step. In the context of LLMs, multi-turn RL enables agents to engage in goal-directed, temporally extended decision-making, manage delayed rewards, and assign credit across long horizons—capabilities essential for real-world applications spanning dialogue systems, tool use, software engineering, web interaction, and complex information retrieval.

1. Foundations and Challenges of Multi-Turn RL

Multi-turn RL extends the standard RL paradigm—traditionally focused on single-step or episode-length credit assignment—by introducing new complexities:

Long-Horizon Decision Sequences: Agents must make a series of dependent decisions, incurring sparse, delayed, or trajectory-level rewards—not just immediate feedback.
Credit Assignment: Identifying which individual actions in a long sequence contribute to eventual success.
Partial Observability and Non-Markovian States: States are typically concatenations of rich histories (e.g., entire conversation traces or action–observation sequences) rather than stateless or fully observed Markov processes.
Stability and Sample Efficiency: On-policy algorithms (e.g., PPO) often face instability and sample inefficiency when applied to highly compositional or interactive text environments (Abdulhai et al., 2023).
Distributional Drift: Especially when reasoning over external feedback (e.g., tool outputs), distribution shift can cause instability—manifesting, for example, as gradient explosions in policy optimization (Xue et al., 2 Sep 2025).

These challenges are reflected in environments such as LMRL-Gym, which provides a suite of eight multi-turn language tasks including text-based games, question-answering with information gathering, and negotiation (Abdulhai et al., 2023). Empirical observations across benchmarks demonstrate that trajectory stitching, partial observability, and credit assignment are bottlenecks for effective multi-turn RL.

2. Core Methodological Approaches

2.1 Hierarchical and Utterance-Level RL

Hierarchical RL decomposes the multi-turn process into levels operating at different granularity:

High-Level (Utterance/Turn-Level): Each action corresponds to a full utterance or a dialogue turn. The value function (critic) is trained off-policy (e.g., via temporal-difference learning) over utterances, enabling more efficient credit assignment across turns.
Low-Level (Token-Level): Token generation is treated as a lower-level MDP embedded within each utterance; the actor is optimized on-policy using policy gradients with utterance-level rewards as terminal signals (Zhou et al., 29 Feb 2024).

This architecture, exemplified by frameworks such as ArCHer, reduces effective horizon, allows for dramatic (∼100x) improvements in sample efficiency, and enables tractable learning and planning over extended sequences.

2.2 Group-Relative and Turn-Level Advantage Estimation

Turn-level credit assignment improves agent learning by attributing reward and advantage more precisely at each interaction step:

Group Relative Preference Optimization (GRPO): Advantages at each turn are computed by normalizing rewards against a batch or group of parallel trajectories, e.g.,

$A_{i, t} = \frac{R_i - \operatorname{mean}(\{R_k\})}{\operatorname{std}(\{R_k\})}.$

Multi-Turn GRPO (MT-GRPO): Distinguishes turn-level and outcome-level advantages; for two-turn tool use tasks,

$A^{(MT-GRPO)}_{i, 1} = A^{(T)}_i + \lambda A^{(O)}_i \quad A^{(MT-GRPO)}_{i, 2} = A^{(O)}_i.$

Empirical results indicate that MT-GRPO achieves 100% success in tool invocation and 50% exact match accuracy, significantly outperforming baselines that ignore intermediate step rewards (Wei et al., 17 May 2025).

2.3 Preference-Based and Trajectory-Level Optimization

RL from Human Feedback (RLHF) is extended to multi-turn settings by:

Soliciting feedback and optimizing utility over entire trajectories (full conversations), not isolated actions.
Using mirror-descent–based policy optimization and preference-based Q-functions with regularization to anchor or reference policies:

$\pi_{t+1}(y|x) \propto \pi_\mathrm{ref}(y|x)^{\alpha\eta_t} \cdot \pi_t(y|x)^{1-\alpha\eta_t} \cdot \exp(\eta_t Q_a^{(\pi_t, \pi_t)}(x, y)).$

This design, showcased in MTPO and its variants, yields provable convergence to Nash equilibria and empirical gains in task scenarios requiring holistic, multi-turn alignment (Shani et al., 23 May 2024).

2.4 Stochastic Rollout Management and Stability

Stabilizing multi-turn RL requires careful handling of pathological rollouts, especially those containing low-probability “void” turns (e.g., no code block or final answer produced):

Trajectory Filtering: Algorithms like SimpleTIR detect and filter out void-turn trajectories from gradient updates, directly mitigating gradient explosion from distributional drift caused by tool feedback (Xue et al., 2 Sep 2025).
Gated Reward Accumulation: G-RA accumulates immediate stepwise rewards only when a superior (e.g., outcome-level) reward meets a specified threshold, preventing reward hacking and aligning optimization with long-term objectives (Sun et al., 14 Aug 2025).

3. Benchmarking, Evaluation, and Practical Systems

Numerous benchmarks and frameworks formalize, evaluate, and facilitate multi-turn RL in LLMs and agentic settings:

Benchmark/Framework	Category	Key Features
LMRL-Gym (Abdulhai et al., 2023)	Multi-Turn LLM RL	8 tasks, offline+online eval
ColBench (Zhou et al., 19 Mar 2025)	Collaborative RL	Backend/frontend design, SOTA
SWE-bench (Golubev et al., 5 Aug 2025)	Software Engineering	Long-context, sparse R
WebArena-Lite (Wei et al., 22 May 2025)	Web Agent RL	Binary rewards, open web tasks
TAU2-Bench (Zhao et al., 26 Aug 2025)	Tool Use RL	LLM-user simulation, tool calls
MTMedDialog (Feng et al., 26 May 2025)	Clinical Dialogue	Doctor/patient, multi-agent
BrowseComp (Lu et al., 12 Sep 2025)	Deep Search RL	LLM + search/click ops

Evaluation typically includes success rates, Mean Reciprocal Rank (MRR), task completion or modification rate, and specialized exact-match or formatting metrics. User simulation frameworks employing policies such as epsilon-greedy selection provide consistency and control for both training and evaluation (e.g., in CIRCLE (Erbacher et al., 2023)).

4. Applications Across Domains

Multi-turn RL underpins a series of LLM-driven systems with distinct capabilities:

Conversational Query Clarification: The CIRCLE model generates diverse search clarifications using multi-turn RL with rewards that combine relevance, diversity (e.g., Rank-Biased Overlap), and fidelity to reference behavior (Erbacher et al., 2023).
Software Engineering Agents: RL-fine-tuned models on SWE-bench Verified nearly double the success rate over supervised fine-tuned baselines by leveraging token-level reward assignment, dynamic trajectory sampling, and long-context management (Golubev et al., 5 Aug 2025).
Web and Tool-Use Agents: Systems such as WebAgent-R1, MUA-RL, and DeepDive demonstrate robust scaling via asynchronous rollout, reasoning-based prompt templates, agentic tool use, and deep search using web interfaces (Wei et al., 22 May 2025, Zhao et al., 26 Aug 2025, Lu et al., 12 Sep 2025).
Visual Reasoning: Multi-turn RL with grounding-based policy optimization (MGPO) enables large multimodal models to focus on key visual regions in high-resolution images through sequential cropping decisions and cumulative dialogue rounds, outperforming even larger commercial models (Huang et al., 8 Jul 2025).
Iterative Code/Kernel Generation: Kevin leverages multi-turn RL for CUDA kernel optimization, showing substantial gains in correctness and speedup through discounted per-turn reward attribution across refinement steps (Baronio et al., 16 Jul 2025).
Clinical Dialogue: DoctorAgent-RL frames multi-agent clinical consultation as a dynamic multi-turn decision process, incorporating domain-specific reward aggregation and adaptive questioning (Feng et al., 26 May 2025).

5. RL Objective Functions and Optimization Techniques

Multi-turn RL systems utilize a variety of mathematical formulations, including but not limited to:

Expected Discounted Return:

$\pi^* = \underset{\pi}{\arg\max}\, \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{T-1} \gamma^t r(s_t, a_t) \right]$

Value Estimation (e.g., Monte Carlo, TD):

$R_t = \sum_{i=t}^{T-1} \gamma^{i-t} r_i$

Policy Optimization (PPO/GRPO/MTPO):

$L_\pi = \mathbb{E}_\pi \left[ \min \left( A(w_t, s_t) \frac{\pi(w_t|s_t)}{\pi_\text{old}(w_t|s_t)},\, A(w_t, s_t) \cdot \text{clip}\left(\frac{\pi(w_t|s_t)}{\pi_\text{old}(w_t|s_t)}, 1-\epsilon, 1+\epsilon \right) \right) \right]$

Turn-Level Advantage (MT-GRPO):

$A^{(MT-GRPO)}_{i,1} = A^{(T)}_i + \lambda A^{(O)}_i$

Theoretical results establish convergence properties and stability for specific algorithms, e.g., mirror descent–based policy optimization shown to contract KL divergence toward Nash equilibria (Shani et al., 23 May 2024).

6. Research Directions and Open Problems

Multi-turn RL for LLMs and agents remains an active research area with several open issues:

Trajectory-Level Preference and Human Feedback: Exploiting full-conversation preferences rather than turn-level feedback improves alignment, but translation into large-scale, online human–agent interaction remains a challenge (Shani et al., 23 May 2024).
Credit Assignment Beyond Bandit Settings: Pure bandit (single-turn) RLHF is insufficient for multi-step planning; exploiting MDP structure and turn-level reward signals is critical for tool-use, agentic action, and strategic interaction (Wei et al., 17 May 2025, Zhou et al., 19 Mar 2025).
Training Stability and Rollout Management: Techniques such as trajectory filtering (SimpleTIR) and Gated Reward Accumulation (G-RA) are being developed to mitigate distributional drift and reward misalignment (Xue et al., 2 Sep 2025, Sun et al., 14 Aug 2025).
Scaling Laws and Sample Efficiency: Hierarchical approaches and off-policy TD learning improve scaling with both environment complexity and model size (Zhou et al., 29 Feb 2024).
Benchmarking and Reproducibility: Open-source frameworks (LMRL-Gym, AgentGym-RL, DeepDive) accelerate rigorous algorithmic comparison and task-specific innovation (Abdulhai et al., 2023, Xi et al., 10 Sep 2025, Lu et al., 12 Sep 2025).

7. Implications, Limitations, and Outlook

Recent advances substantiate that multi-turn RL methods enable LLMs and multimodal models to act as coherent, strategic agents over extended horizons. Empirical results consistently demonstrate absolute gains in accuracy, success rates, and reasoning depth across diverse domains—often with open-source models rivaling or surpassing frontier proprietary systems (Feng et al., 26 May 2025, Wei et al., 22 May 2025, Lu et al., 12 Sep 2025).

Persistent challenges include long-term credit assignment in very long dialogues, reward design that avoids reward hacking and misalignment, and instabilities from compounding distributional drift, especially in tool-integrated or open-ended environments. Furthermore, integrating preference-based feedback at scale and generalizing to unseen task structures remain open lines of inquiry.

The field is trending toward unified, extensible RL frameworks supporting modular environments, open-sourcing of training and evaluation codebases, and deeper integration of curriculum learning, hierarchical control, and reasoning-aware rewards. These directions suggest multi-turn RL will continue to be a key enabler for the next generation of interactive, autonomous, and reliable intelligent agents.