Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 161 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 79 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

MT-GRPO: Multi-Turn RL Optimization

Updated 1 November 2025
  • MT-GRPO is a reinforcement learning framework that extends GRPO with turn-level rewards for fine-grained credit assignment in multi-turn scenarios.
  • It reformulates the agent's task as a multi-step MDP, allowing precise evaluation of intermediate decisions during long-horizon reasoning.
  • The approach enhances training stability and performance in tasks like tool-invoked searches, despite computational exponential rollout challenges.

Multi-Turn GRPO (MT-GRPO) refers to a reinforcement learning framework that systematically extends Group Relative Policy Optimization (GRPO) to multi-turn, long-horizon reasoning with LLMs. MT-GRPO is characterized by fine-grained turn-level reward design, enabling precise credit assignment at each interaction step, in contrast to conventional GRPO’s trajectory-level rewards. This approach is highly relevant for agentic LLM applications—such as reasoning-augmented search agents, tool-using agents, and multi-turn task planning—where sparse outcome rewards and undifferentiated credit limit learning efficiency and stability.

1. GRPO Foundations and the Challenge of Multi-Turn Reasoning

GRPO is an on-policy RL algorithm designed to align LLMs using a preference-based objective. For a given input xx, GRPO samples a group of GG candidate responses {y1,...,yG}\{y_1, ..., y_G\}, evaluates each using a reward function R()R(\cdot), and computes a group-relative advantage for policy gradient updates via normalization: Ai,t=Rtrajimean({Rtraji}i=1G)std({Rtraji}i=1G)A_{i,t} = \frac{R^{\text{traj}_i} - \mathrm{mean}(\{R^{\text{traj}_i}\}_{i=1}^G)}{\mathrm{std}(\{R^{\text{traj}_i}\}_{i=1}^G)} This formalism treats each sample as a full trajectory (i.e., an entire dialog or reasoning chain). While effective for single-turn or episodically-bounded RLHF scenarios, this procedure is fundamentally limited in multi-turn agent settings—each token in all turns receives the same advantage, impeding nuanced learning across intermediate reasoning, tool invocation, and outcome production.

2. Turn-Level MDP Formulation and Credit Assignment

MT-GRPO rigorously reformulates the agent’s task as a multi-step Markov Decision Process (MDP) where each turn kk has associated state sks_k, action aka_k, transition PP, and reward R(sk,ak)R(s_k, a_k). The learning objective is: maxπθE[k=1KγkR(sk,ak)]\max_{\pi_\theta} \mathbb{E} \left[ \sum_{k=1}^K \gamma^k R(s_k, a_k)\right] where γ\gamma discounts future rewards.

The innovation in MT-GRPO is explicit turn-level credit: at each time step kk, the agent receives not only a final trajectory reward but also dense, verifiable feedback specific to its decision at that turn. This facilitates precise updates and encourages robust, stepwise reasoning behavior.

3. MT-GRPO Algorithm and Mathematical Formulation

The key mechanism is decomposition of the groupwise advantage at the granularity of individual turns, with each group of GG samples per turn kk yielding normalized turn-level advantages.

For K=2K=2 (reasoning + answer),

Ai,1MTGRPO=AiI+αAiO Ai,2MTGRPO=AiO\begin{align*} A^{\mathrm{MT-GRPO}}_{i,1} &= A^I_{i} + \alpha A^O_{i} \ A^{\mathrm{MT-GRPO}}_{i,2} &= A^O_{i} \end{align*}

where

AiI=RiImean({RiI})std({RiI}),AiO=RiOmean({RiO})std({RiO})A^I_i = \frac{R^I_i - \mathrm{mean}(\{R^I_i\})}{\mathrm{std}(\{R^I_i\})},\quad A^O_i = \frac{R^O_i - \mathrm{mean}(\{R^O_i\})}{\mathrm{std}(\{R^O_i\})}

and RiIR^I_i, RiOR^O_i are intermediate and outcome rewards for group member ii, respectively; α\alpha is a tunable discount.

For general KK,

Ai,(k)MTGRPO=l=kK1αlkAi,(l)I+αKkAiOA^{\mathrm{MT-GRPO}}_{i,(k)} = \sum_{l=k}^{K-1} \alpha^{l-k} A^{I}_{i,(l)} + \alpha^{K-k} A^{O}_{i}

This construction propagates both local and future information to each turn’s tokens, allowing fine-grained assignment of positive or negative credit based on intermediate actions and the long-term trajectory outcome.

4. Computational Structure and Practical Limitations

MT-GRPO’s group sampling introduces a branching structure for rollouts over the multi-turn action space. For horizon KK and group size GG, the required number of rollouts is GK1G^{K-1} (exponential in KK), sharply contrasting with vanilla GRPO’s linear scaling in GG. This presents computational bottlenecks for long-horizon tasks and restricts practical MT-GRPO applications to scenarios with a small number of decision steps (e.g., K=2K=2).

Another constraint is ensuring all rollouts within a group have identical numbers of turns, often necessitating prompt engineering or environment controls that can affect agent flexibility.

5. Empirical Evaluation and Results

Empirical results on multi-turn, reasoning-augmented search tasks (e.g., TriviaQA with Wikipedia search tool invocation) reveal the advantages of MT-GRPO over trajectory-level variants:

Model Tool Exec Search Answer Format Exact Match
GRPO-OR 0 0 0.04 0
GRPO-MR 0.2 0.3724 0.1994 0.3346
MT-GRPO 0.2 0.3926 0.1996 0.5010
  • MT-GRPO achieves perfect tool execution, the highest correct answer rate, and 100% format correctness.
  • Learning curves display increased stability and faster convergence, with lower variance across seeds.
  • MT-GRPO-trained agents reliably employ tool calls for relevant intermediate input, while trajectory-level agents (GRPO-OR) frequently omit these crucial reasoning stages.
  • GRPO-MR (merged reward signal but single trajectory-level advantage) improves over GRPO-OR but is outperformed by explicit turn-level assignment, confirming the necessity of precise credit localization.

6. Broader Implications and Future Directions

The MT-GRPO paradigm demonstrates that turn-level, process-aligned reward modeling in LLM agent RL supports robust performance on composite, long-horizon reasoning tasks. It enables agents to isolate and reinforce critical intermediate decisions, notably supporting structured pipelines (e.g., tool use, multi-step search, code execution, or chain-of-thought in math).

However, MT-GRPO’s exponential rollout cost is a significant limitation for tasks with more than a few turns. Scalable alternatives, such as actor-critic methods with dense turn-level rewards (e.g., PPO variants), are suggested for longer-horizon settings.

A plausible implication is that future LLM RL frameworks will need to systematically incorporate dense, structured reward signals beyond final outcomes—including hybrid reward models, turn-level evaluators, and modular credit assignment architectures—to fully exploit agentic reasoning capacity in interactive, multi-stage environments.

7. Summary and Outlook

MT-GRPO extends the GRPO algorithm to multi-turn MDPs by introducing turn-level advantage computation and explicit intermediate reward integration. This design achieves significantly improved training stability, convergence, and downstream task accuracy in multi-turn agentic scenarios. Despite the computational costs for long-horizon applications, MT-GRPO provides a principled foundation for RL with LLMs where fine-grained credit assignment is essential for robust, interpretable, and generalizable reasoning behaviors. Future research is expected to address the rollout complexity gap and explore scalable reward deconstruction for ever more sophisticated multi-agent and multi-stage reasoning systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Turn GRPO (MT-GRPO).