Multi-Turn GRPO for Efficient Sequential RL
- The paper introduces mtGRPO, a method that provides dense per-turn, group-relative feedback to address sparse and delayed rewards in multi-turn tasks.
- It employs group-based normalization at each turn to improve credit assignment, leading to enhanced sample efficiency and convergence in sequential decision environments.
- The approach is validated across domains such as autonomous driving, tool-integrated reasoning, and multi-agent collaboration, demonstrating significant empirical gains.
Multi-Turn Group Relative Policy Optimization (mtGRPO) generalizes group-based policy optimization to multi-turn sequential decision problems involving complex credit assignment. It provides dense, turn-level, group-relative feedback for each interaction step, supporting sample-efficient and stable reinforcement learning (RL) in scenarios where rewards are sparse or delayed. mtGRPO is a central innovation for high-performance multi-turn reasoning and tool-use with LLMs and multimodal agents, as demonstrated in domains such as autonomous driving, tool-integrated reasoning, and multi-agent collaboration (Li et al., 30 Jan 2026, Zhong et al., 3 Feb 2026, Ding et al., 18 Nov 2025, Hong et al., 17 Nov 2025, Hu et al., 24 Sep 2025).
1. Foundations and Motivation
Classical RL algorithms such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) perform policy gradient updates using trajectory-level or episode-level rewards. In multi-turn interaction settings—where an agent must reason and act iteratively over several turns—rewards are typically sparse, delayed, and not attributable to individual decisions within the sequence. Standard approaches suffer from two key issues: (1) feedback sparsity, in which most actions receive no learning signal, and (2) poor credit assignment, where reward cannot distinguish which turns contributed positively or negatively to the final outcome (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025).
mtGRPO addresses these limitations by redefining the RL signal at each interaction step: it introduces per-turn reward collection and computes the group-relative advantage for each turn by contrasting the agent’s performance against a batch-level baseline within the same turn. This structure increases the density and specificity of learning signals, substantially enhancing convergence and generalization in long-horizon, multi-turn regimes (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025).
2. Multi-Turn Decision Process and Algorithmic Formulation
An mtGRPO episode is represented as a -turn Markov decision process (MDP), where at each turn , the agent observes a state , takes an action , and immediately receives a reward . The state comprises all relevant information, including environmental context, interaction history, and turn-wise feedback. The objective is to maximize cumulative discounted rewards: where is the discount factor (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025).
mtGRPO proceeds as follows:
- For each batch of rollouts and at each turn , the algorithm collects per-turn rewards .
- It computes the turn-specific batch mean and standard deviation .
- The group-relative advantage for each rollout at turn is .
The per-token surrogate objective extends the PPO/GRPO loss to sum over both tokens and turns, with the group-relative advantage applied to each action token generated at the corresponding turn: where is the importance weight, is the clipping parameter, is the KL penalty, and is a reference (e.g., SFT) policy (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025, Hong et al., 17 Nov 2025).
Pseudocode for mtGRPO (single-agent case) is as follows (Li et al., 30 Jan 2026):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for each RL iteration: # Sample N rollouts under old policy for each rollout i in N: for each turn t in T: a_i_t = agent.act(s_i_t) r_i_t = reward(a_i_t) for each turn t: b_t = mean([r_i_t for i in 1..N]) sigma_t = std([r_i_t for i in 1..N]) for each rollout i: A_hat_i_t = (r_i_t - b_t) / sigma_t Optimize J_mtGRPO using collected (states, actions, advantages) Periodically update reference policy |
3. Extensions: Multi-Turn Reasoning, Tool Use, and Multi-Agent Systems
mtGRPO has been extended across several domains, each adapting the group-relative framework to domain-specific challenges.
- Multi-Turn Trajectory Refinement in Autonomous Driving: In MTDrive, mtGRPO supports an MLLM-based agent performing iterative trajectory refinement, where each turn’s reward is derived from perceptual driving model (PDM) metrics (e.g., collision, drivable area, time-to-collision) and is directly assigned to the tokens generated at that turn, providing efficient and targeted credit assignment (Li et al., 30 Jan 2026).
- Multi-Turn Tool-Calling Agents: mtGRPO is integrated as Reward-Conditioned GRPO (RC-GRPO), which injects reward-mode tokens (e.g.,
<|high_reward|>/<|low_reward|>) at each rollout to promote within-group diversity and combat reward variance collapse. This yields robust policy updates even in environments with sparse or bimodal reward distributions (Zhong et al., 3 Feb 2026). - Tool-Integrated Reasoning with LLMs: Referred to as Group Turn Policy Optimization (GTPO), the method includes per-turn reward shaping (e.g., partial rewards for code-similar negative trajectories), turn-level returns, and discounting, which leads to superior question-answering and reasoning performance on mathematical benchmarks. Here, the group-relative advantage is computed at each reasoning step (Ding et al., 18 Nov 2025).
- Hierarchical Multi-Agent Systems: In M-GRPO, the multi-turn framework is extended to handle a main agent (planner) and subordinate tool agents with different frequencies and response times. Group-relative advantages and PPO-style loss are computed separately for each agent, with batch-wide trajectory alignment to maintain synchronization despite stochastic agent invocation counts (Hong et al., 17 Nov 2025).
4. Reward Assignment, Advantage Estimation, and Credit Sharpening Strategies
A central property of mtGRPO is the densification and sharpening of credit assignment via group-relative normalization at each turn.
- Per-Turn Reward Collection: mtGRPO collects individual step-wise rewards, often aggregating multiple informational channels (e.g., environmental signals, formatting, tool usage metrics), yielding in driving (Li et al., 30 Jan 2026), or in tool reasoning (Ding et al., 18 Nov 2025).
- Relative Advantage Computation: By normalizing each rollout’s return by the contemporaneous batch mean and standard deviation at the same turn, mtGRPO mitigates low-variance scenarios and ensures sustained gradient updates, even when the policy becomes peaked after extensive SFT (Zhong et al., 3 Feb 2026).
- Reward Shaping: In tool-based reasoning, partial rewards are awarded for code similarity between failed and successful runs, based on embedding similarity (e.g., Titan Text Embeddings), further densifying the learning signal (Ding et al., 18 Nov 2025).
- Reward-Conditioned Sampling: In RC-GRPO, explicit conditioning on reward tokens structurally introduces reward variance within each batch group, guaranteeing non-degenerate group normalization and improved advantage spread (Zhong et al., 3 Feb 2026).
5. System-Level Optimizations and Implementation
mtGRPO’s adoption in high-throughput multimodal RL settings necessitates system-level engineering (Li et al., 30 Jan 2026).
- Inter-Process Streaming Serialization (IPSS) enables immediate tensor serialization and streaming to training workers as soon as a rollout is completed, minimizing idle time and maximizing device utilization.
- Intra-Process Tensor Cache (IPTC) consolidates multimodal embeddings and tokenization across co-located modules (actor, reference, log-prob computation), reducing redundant deserialization and memory copies.
- Both optimizations jointly achieve a speedup in wall-clock training throughput, critical for large-scale RL with high-dimensional inputs and multi-turn sequences.
In multi-agent distributed training, M-GRPO adopts a decoupled pipeline in which each agent operates independently, sharing only scalar reward statistics and trajectory identifiers via a lightweight database, ensuring scalable deployment without cross-server backpropagation or parameter sharing (Hong et al., 17 Nov 2025).
6. Empirical Results and Comparative Performance
mtGRPO and its extensions have demonstrated significant empirical gains across diverse benchmarks:
| Domain | Baseline (SFT/GRPO) | mtGRPO Variant | Reported Gain |
|---|---|---|---|
| Autonomous Driving | SFT: 88.1, GRPO: 94.2 | mtGRPO: 96.2 (oracle) | Exceeds VLM-driving/human |
| Tool Calling | SFT+GRPO: 48.75 | mtGRPO: 85.00 (Qwen) | Surpasses closed-API models |
| Reasoning (GTPO) | GRPO: 49.78 | GTPO: 51.26 | +3.0% average |
| Multi-Agent | Single: 54-58 | M-GRPO: 68-72 | +7% over main-only/frozen |
| Task Planning | Larger model: 0%-SR | mtGRPO 1.5B: 70%-SR | Outperforms 14B models |
In ablation studies, per-turn advantage normalization is identified as critical for convergence and non-vanishing policy gradients. Qualitative analyses demonstrate that error correction (e.g., in trajectory rollouts) is localized and successive turns target previously problematic steps (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025, Hu et al., 24 Sep 2025).
7. Theoretical Guarantees and Application Implications
Under appropriate assumptions (unique minimal-turn expert trajectories, dense verifiable rewards), mtGRPO provides formal guarantees: improvements in the group-based single-turn objective provably translate to higher multi-turn success probabilities and more sample-efficient policies, as established by backward induction arguments and explicit bounds on success probability (Hu et al., 24 Sep 2025). Empirical evidence supports strong cross-task generalization, minimal task completion times, and stable learning curves.
A plausible implication is that future research extending mtGRPO to more open-ended, partially observable, or hierarchical tasks will need to further refine group assignment, baseline computation, and advantage normalization techniques to accommodate the increased complexity of credit assignment.
Key References
- MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving (Li et al., 30 Jan 2026)
- RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents (Zhong et al., 3 Feb 2026)
- Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization (Ding et al., 18 Nov 2025)
- Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO (Hong et al., 17 Nov 2025)
- Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning (Hu et al., 24 Sep 2025)