Multi-Turn GRPO: Policy Optimization Framework

Updated 2 July 2026

Multi-Turn GRPO is a reinforcement learning framework that optimizes policies over sequential turns using group-relative advantage estimation.
It compares multiple candidate solutions per state to normalize rewards, reducing variance and mitigating reward noise.
The framework enables stable credit assignment and improved performance in complex multi-agent and multi-stage tasks such as dialogue, planning, and coordinated control.

Multi-Turn Group Relative Policy Optimization (GRPO) is a reinforcement learning framework designed to stabilize and improve policy optimization for multi-agent, multi-stage, or multi-turn tasks, particularly in settings involving LLMs and complex reasoning environments. GRPO generalizes single-step, value-critic-free group advantage estimation to the multi-turn regime, enabling fine-grained credit assignment, variance reduction, and stable learning in high-complexity domains. It achieves this by comparing multiple candidate solutions (or policies) per state or query and normalizing advantage estimates with respect to the group, thus insulating the optimization process from reward noise and difficulty variation. In multi-turn extensions, the group-relative principle is propagated or recomputed at each dialogue turn, planning stage, or iteration, adapting to the evolving decision process and supporting fine-tuned optimization of sequential agentic behaviors (Cang et al., 3 Mar 2026, Feng et al., 21 Apr 2026, Ekbote et al., 11 Nov 2025, Ding et al., 5 Jun 2025, Zhou et al., 1 Mar 2026).

1. Core Principles and Problem Formulation

Multi-Turn GRPO addresses sequential optimization problems where the optimal decision or communication structure must be learned iteratively or across multiple stages or turns. In the multi-agent domain, this often involves learning communication topologies, coordinated joint actions, or planning policies over a series of steps. The state at each turn may encode the full interaction history, environment observations, role embeddings, or other contextual information, while the action can be either a discrete/sampled policy (e.g., a communication graph) or a sequence of agent actions (Cang et al., 3 Mar 2026, Feng et al., 21 Apr 2026).

For a given turn $t$ in a setting with $T$ total turns, the policy $\pi_\theta^t$ generates actions or structures based on the turn-specific state, with per-turn or final rewards provided via verifiable criteria. The multi-turn objective is to maximize the expected cumulative reward (or, in critic-free RLVR, the final/episodic reward) over the configuration trajectories produced during the interaction (Ekbote et al., 11 Nov 2025, Zhou et al., 1 Mar 2026).

2. Group-Relative Advantage and Multi-Turn Normalization

At the heart of the framework is group-relative advantage estimation. Rather than assigning credit solely based on the absolute reward of a single trajectory or action, GRPO samples a set of $K$ candidate solutions (trajectories, communication graphs, completions, etc.) per environment, prompt, or agent per turn. For each group:

Empirical Success Rate: For each sampled decision variable (e.g., agent edge $e_{ij}$ or action sequence), its empirical marginal success rate over the group,

$S_{ij} = \frac{\sum_{k=1}^K I[(i,j)\in G_k]\, r_k}{\sum_{k=1}^K I[(i,j)\in G_k] + \epsilon}$

is computed, where $r_k$ denotes the verifiable reward for candidate $k$ and $\epsilon$ is for numerical stability.

Group-Normalized Advantage: The mean $\mu_S$ and standard deviation $T$ 0 across the group are used to derive a group-centered z-scored advantage,

$T$ 1

The advantage serves as a fine-grained, variance-reduced teaching signal, assigning positive credit only to those components whose inclusion measurably improves success over the current group context. This mechanism is naturally extended in multi-turn settings by recomputing advantages at each turn $T$ 2 and accumulating gradient or loss contributions accordingly (Cang et al., 3 Mar 2026, Feng et al., 21 Apr 2026, Ekbote et al., 11 Nov 2025).

3. Policy Update and Optimization Mechanics

The GRPO policy update is constructed using a PPO-style clipped surrogate loss over group-normalized advantages, with a trust-region KL penalty anchoring the update toward a reference policy:

$T$ 3

where $T$ 4 represents an action/decision variable, and $T$ 5 encodes the turn-specific state, query, or input. Iteratively, the optimizer (typically Adam) performs gradient descent on this loss. In multi-turn regimes, summed or accumulated advantages from all turns are backpropagated, either with separate per-turn policy heads or parameter sharing across turns (Cang et al., 3 Mar 2026, Ekbote et al., 11 Nov 2025).

Notably, in multi-agent systems, the per-agent or per-edge advantage is computed over parallel environment rollouts and used to update the corresponding agent's (or action's) policy network (Feng et al., 21 Apr 2026). The group baseline subtraction and normalization mechanisms allow GRPO to circumvent saddlepoints and reward collapse associated with uniform or homogeneous group rewards, a recognized problem in sparse- or binary-reward domains (Zhong et al., 3 Feb 2026, Salmani-Zarchi et al., 4 Jun 2026).

4. Variance Reduction, Credit Assignment, and Theoretical Properties

GRPO's main statistical virtue is variance reduction. By anchoring the learning signal to within-group performance, fluctuations arising from query/task difficulty or reward stochasticity are sharply attenuated. In multi-turn settings, the group normalization:

Prevents gradient blowup on trivial/easy turns or queries where all samples are correct—because $T$ 6 in these cases,
Avoids reward vanishing in difficult scenarios—since $T$ 7 when all $T$ 8, thus skipping non-informative updates,
Supports edge-level or turn-level localization of credit, so that only positively contributing substructures, decisions, or turns are reinforced (Cang et al., 3 Mar 2026, Ekbote et al., 11 Nov 2025, Zhou et al., 1 Mar 2026).

The policy gradient estimated by multi-turn GRPO is formally a U-statistic, i.e., an average over all unique pairwise differences of score–reward products, enabling tight mean-squared error, bias, and asymptotic optimality analyses. GRPO matches the oracle policy-gradient variance in the infinite-group limit and is strictly better than vanilla REINFORCE and other naive baseline approaches in expected mean-squared error (Zhou et al., 1 Mar 2026). A universal scaling law for the group size $T$ 9 ensures robust performance; in practice, $\pi_\theta^t$ 0 determined empirically for multi-turn chains.

5. Multi-Turn Algorithmic Extensions and Applications

Multi-Turn GRPO has been instantiated in several algorithmic variants and task domains:

Topology Optimization in Multi-Agent Systems: In dynamic communication-graph learning, each turn can correspond to an interaction or subgoal, with iterative group-sampled graphs and per-edge relative advantages driving the policy update (Cang et al., 3 Mar 2026).
CRPO for Multi-Agent Continuous Control: In M $\pi_\theta^t$ 1GRPO, underwater biomimetic pursuit policies leverage multi-agent group normalization for joint pursuit, using centralized training with decentralized execution and observation-history–driven policies for long-horizon scenarios (Feng et al., 21 Apr 2026).
Self-Correcting Code Generation and Reasoning: Algorithms such as Murphy and MGRPO extend GRPO to iterative self-correction, where each stage produces new rollouts from failed leaves, and rewards are propagated back along the decision tree via maximization or mean rules, with group normalization applied per turn. This yields significant gains in pass@1 and minimization of reward collapse in code and reasoning benchmarks (Ekbote et al., 11 Nov 2025, Ding et al., 5 Jun 2025).
Interactive Dialogue and Dynamically Evolving Environments: Approaches such as T $\pi_\theta^t$ 2-GRPO use independent group normalization for both turn-level and trajectory-level rewards, fusion for credit signal shaping, and hard vetoes for safety constraints, successfully stabilizing emotionally-sensitive multi-turn agent training (Song et al., 7 Jun 2026).

6. Limitations, Stability Considerations, and Empirical Performance

Despite its robustness, multi-turn GRPO may encounter degenerate gradients when group variance vanishes—e.g., after overfitting in supervised finetuning (SFT) or with highly homogeneous policies. Variance-preserving interventions, such as multi-temperature sampling, dual-anchor baselines, reward conditioning, or explicit entropy regularization, are necessary in such regimes (Salmani-Zarchi et al., 4 Jun 2026, Zhong et al., 3 Feb 2026). Compared to token-level and per-trajectory PPO advantage baselines, GRPO remains sample-efficient and stable even in low-reward or long-horizon cases, but recent extensions (e.g., Murphy, T $\pi_\theta^t$ 3-GRPO, MGRPO) further improve learning dynamics by exploiting self-correction, multi-scale normalization, and policy diversity maintenance.

Empirical results across multi-agent pursuit (Feng et al., 21 Apr 2026), complex reasoning (Cang et al., 3 Mar 2026), tool-use (Zhong et al., 3 Feb 2026), and code generation (Ekbote et al., 11 Nov 2025) show that Multi-Turn GRPO consistently yields higher or on-par final performance compared with state-of-the-art baselines, supports scalable training across team sizes and long horizons, and is robust to the variance and reward sparsity typical in RLVR for LLMs.

7. Future Directions and Open Problems

Open research areas for multi-turn GRPO include further theoretical characterization of cross-turn and cross-layer credit assignment, enhancement of group-induced variance with more adaptive or structured sampling, integration of learned or hybrid value critics for dense intermediate rewards, and extension to fully decentralized, off-policy, or partially observable multi-agent settings. The interplay between group-size selection, rollout reuse, staleness management, and system efficiency represents an active system-level optimization domain, with methods such as $\pi_\theta^t$ 4-GRPO providing substantial speedups under high-staleness regimes (Tian et al., 17 May 2026). Additional practical questions concern the optimal composition of reward signals, scaling of group-based normalization under massive agent pools, and the application of multi-turn GRPO to new agentic and interactive domains (Ekbote et al., 11 Nov 2025, Cang et al., 3 Mar 2026, Feng et al., 21 Apr 2026).