MT-GRPO: Multi-Turn RL Optimization

Updated 1 November 2025

MT-GRPO is a reinforcement learning framework that extends GRPO with turn-level rewards for fine-grained credit assignment in multi-turn scenarios.
It reformulates the agent's task as a multi-step MDP, allowing precise evaluation of intermediate decisions during long-horizon reasoning.
The approach enhances training stability and performance in tasks like tool-invoked searches, despite computational exponential rollout challenges.

Multi-Turn GRPO (MT-GRPO) refers to a reinforcement learning framework that systematically extends Group Relative Policy Optimization (GRPO) to multi-turn, long-horizon reasoning with LLMs. MT-GRPO is characterized by fine-grained turn-level reward design, enabling precise credit assignment at each interaction step, in contrast to conventional GRPO’s trajectory-level rewards. This approach is highly relevant for agentic LLM applications—such as reasoning-augmented search agents, tool-using agents, and multi-turn task planning—where sparse outcome rewards and undifferentiated credit limit learning efficiency and stability.

1. GRPO Foundations and the Challenge of Multi-Turn Reasoning

GRPO is an on-policy RL algorithm designed to align LLMs using a preference-based objective. For a given input $x$ , GRPO samples a group of $G$ candidate responses $\{y_1, ..., y_G\}$ , evaluates each using a reward function $R(\cdot)$ , and computes a group-relative advantage for policy gradient updates via normalization: $A_{i,t} = \frac{R^{\text{traj}_i} - \mathrm{mean}(\{R^{\text{traj}_i}\}_{i=1}^G)}{\mathrm{std}(\{R^{\text{traj}_i}\}_{i=1}^G)}$ This formalism treats each sample as a full trajectory (i.e., an entire dialog or reasoning chain). While effective for single-turn or episodically-bounded RLHF scenarios, this procedure is fundamentally limited in multi-turn agent settings—each token in all turns receives the same advantage, impeding nuanced learning across intermediate reasoning, tool invocation, and outcome production.

2. Turn-Level MDP Formulation and Credit Assignment

MT-GRPO rigorously reformulates the agent’s task as a multi-step Markov Decision Process (MDP) where each turn $k$ has associated state $s_k$ , action $a_k$ , transition $P$ , and reward $R(s_k, a_k)$ . The learning objective is: $\max_{\pi_\theta} \mathbb{E} \left[ \sum_{k=1}^K \gamma^k R(s_k, a_k)\right]$ where $\gamma$ discounts future rewards.

The innovation in MT-GRPO is explicit turn-level credit: at each time step $k$ , the agent receives not only a final trajectory reward but also dense, verifiable feedback specific to its decision at that turn. This facilitates precise updates and encourages robust, stepwise reasoning behavior.

3. MT-GRPO Algorithm and Mathematical Formulation

The key mechanism is decomposition of the groupwise advantage at the granularity of individual turns, with each group of $G$ samples per turn $k$ yielding normalized turn-level advantages.

For $K=2$ (reasoning + answer),

$\begin{align*} A^{\mathrm{MT-GRPO}}_{i,1} &= A^I_{i} + \alpha A^O_{i} \ A^{\mathrm{MT-GRPO}}_{i,2} &= A^O_{i} \end{align*}$

where

$A^I_i = \frac{R^I_i - \mathrm{mean}(\{R^I_i\})}{\mathrm{std}(\{R^I_i\})},\quad A^O_i = \frac{R^O_i - \mathrm{mean}(\{R^O_i\})}{\mathrm{std}(\{R^O_i\})}$

and $R^I_i$ , $R^O_i$ are intermediate and outcome rewards for group member $i$ , respectively; $\alpha$ is a tunable discount.

For general $K$ ,

$A^{\mathrm{MT-GRPO}}_{i,(k)} = \sum_{l=k}^{K-1} \alpha^{l-k} A^{I}_{i,(l)} + \alpha^{K-k} A^{O}_{i}$

This construction propagates both local and future information to each turn’s tokens, allowing fine-grained assignment of positive or negative credit based on intermediate actions and the long-term trajectory outcome.

4. Computational Structure and Practical Limitations

MT-GRPO’s group sampling introduces a branching structure for rollouts over the multi-turn action space. For horizon $K$ and group size $G$ , the required number of rollouts is $G^{K-1}$ (exponential in $K$ ), sharply contrasting with vanilla GRPO’s linear scaling in $G$ . This presents computational bottlenecks for long-horizon tasks and restricts practical MT-GRPO applications to scenarios with a small number of decision steps (e.g., $K=2$ ).

Another constraint is ensuring all rollouts within a group have identical numbers of turns, often necessitating prompt engineering or environment controls that can affect agent flexibility.

5. Empirical Evaluation and Results

Empirical results on multi-turn, reasoning-augmented search tasks (e.g., TriviaQA with Wikipedia search tool invocation) reveal the advantages of MT-GRPO over trajectory-level variants:

Model	Tool Exec	Search Answer	Format	Exact Match
GRPO-OR	0	0	0.04	0
GRPO-MR	0.2	0.3724	0.1994	0.3346
MT-GRPO	0.2	0.3926	0.1996	0.5010

MT-GRPO achieves perfect tool execution, the highest correct answer rate, and 100% format correctness.
Learning curves display increased stability and faster convergence, with lower variance across seeds.
MT-GRPO-trained agents reliably employ tool calls for relevant intermediate input, while trajectory-level agents (GRPO-OR) frequently omit these crucial reasoning stages.
GRPO-MR (merged reward signal but single trajectory-level advantage) improves over GRPO-OR but is outperformed by explicit turn-level assignment, confirming the necessity of precise credit localization.

6. Broader Implications and Future Directions

The MT-GRPO paradigm demonstrates that turn-level, process-aligned reward modeling in LLM agent RL supports robust performance on composite, long-horizon reasoning tasks. It enables agents to isolate and reinforce critical intermediate decisions, notably supporting structured pipelines (e.g., tool use, multi-step search, code execution, or chain-of-thought in math).

However, MT-GRPO’s exponential rollout cost is a significant limitation for tasks with more than a few turns. Scalable alternatives, such as actor-critic methods with dense turn-level rewards (e.g., PPO variants), are suggested for longer-horizon settings.

A plausible implication is that future LLM RL frameworks will need to systematically incorporate dense, structured reward signals beyond final outcomes—including hybrid reward models, turn-level evaluators, and modular credit assignment architectures—to fully exploit agentic reasoning capacity in interactive, multi-stage environments.

7. Summary and Outlook

MT-GRPO extends the GRPO algorithm to multi-turn MDPs by introducing turn-level advantage computation and explicit intermediate reward integration. This design achieves significantly improved training stability, convergence, and downstream task accuracy in multi-turn agentic scenarios. Despite the computational costs for long-horizon applications, MT-GRPO provides a principled foundation for RL with LLMs where fine-grained credit assignment is essential for robust, interpretable, and generalizable reasoning behaviors. Future research is expected to address the rollout complexity gap and explore scalable reward deconstruction for ever more sophisticated multi-agent and multi-stage reasoning systems.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Turn GRPO (MT-GRPO).