Group Turn Policy Optimization (GTPO)

Updated 11 May 2026

Group Turn Policy Optimization (GTPO) is a family of RL algorithms that assigns rewards at token, turn, and trajectory levels for precise feedback in long-horizon tasks.
It overcomes the limitations of coarse credit assignment and gradient conflicts seen in standard methods like GRPO, enhancing reasoning and stability.
GTPO employs entropy-based and information gain techniques to shape rewards effectively, boosting performance in chain-of-thought and tool-integrated dialogue tasks.

Group Turn Policy Optimization (GTPO) is a family of reinforcement learning (RL) algorithms fundamentally designed to address fine-grained credit assignment for LLMs in long-horizon and multi-turn tasks. These algorithms represent an advance over standard group-based RL optimizers, particularly Group Relative Policy Optimization (GRPO), by resolving the coarse credit assignment and instability that limit reasoning performance in both single-turn and multi-turn environments. GTPO encapsulates several techniques spanning token-level, turn-level, and trajectory-level reward shaping, often incorporating model uncertainty or process signals, and is applicable to complex domains such as chain-of-thought reasoning and tool-integrated dialogues (Tan et al., 6 Aug 2025, Simoni et al., 5 Aug 2025, Ding et al., 18 Nov 2025, Chen et al., 7 May 2026).

1. Motivation and Problem Scope

Standard group-based RL algorithms, exemplified by GRPO, assign a uniform, sequence-level reward to all tokens or actions in a sampled group of model-generated completions or trajectories. This uniform credit assignment is fundamentally misaligned with the structure of long-chain reasoning or multi-turn tool-integrated reasoning tasks, where not all tokens or turns contribute equally to successful outcomes. Key deficiencies of GRPO include:

Coarse Credit Assignment: All tokens/turns in a trajectory—regardless of their decision criticality—receive the same reward, disregarding finer details of the agent's uncertainty or effort.
Gradient Conflicts: Tokens recurring across both positive and negative reward trajectories are subject to conflicting gradients, harming structural tokens' probability calibration and output formatting (Simoni et al., 5 Aug 2025).
Sparse Rewards for Complex Tasks: In multi-turn tool-integrated reasoning, binary outcome-based rewards ignore partially correct sub-decisions and useful intermediate steps, stagnating learning (Ding et al., 18 Nov 2025).
Policy Instability: Penalizing confident but incorrect outputs with negative advantage can trigger entropy drift, increasing the output distribution entropy and destabilizing policy learning (Simoni et al., 5 Aug 2025).

GTPO and its variants were developed to target these challenges by implementing fine-grained reward assignment at the level of individual tokens (token-level GTPO), “turns” in multi-turn reasoning tasks (turn-level GTPO), or with additional credit-shaping and filtering mechanisms (e.g., entropy-based or information-gain-based).

2. Algorithmic Foundations and Mathematical Formulations

GTPO encompasses a spectrum of related algorithms, each characterized by the granularity and type of reward signals used for policy optimization. The following table summarizes key GTPO variants and their central algorithmic features:

Algorithm	Reward Granularity	Shaping Signal	Distinctive Features
Token-level GTPO	Token	Policy entropy	Entropy-weighted per-token reward shaping, stop-gradient on entropy (Tan et al., 6 Aug 2025)
Turn-level GTPO	Turn	Trajectory returns, code	Discounted, normalized return, code-similarity shaping (Ding et al., 18 Nov 2025)
Trajectory GTPO	Trajectory/tokens	Conflict/entropy masks	Skipping negative updates for conflict tokens, entropy filters (Simoni et al., 5 Aug 2025)
A $^2$ TGPO	Turn	Information Gain (IG)	Turn-group normalized IG, variance-rescaled accumulation, adaptive clipping (Chen et al., 7 May 2026)

2.1 Token-Level Entropy-Based Reward Shaping

Token-level GTPO (Tan et al., 6 Aug 2025) computes the per-token policy entropy and uses it to weight a token-specific reward, focusing learning on high-uncertainty (high-entropy) positions within correct responses, thus reflecting decision importance. Formally, for each batch:

Compute the policy entropy $H_{i,t}$ for token $t$ in sequence $i$ .
For reward-1 sequences, form a normalized entropy bonus $\alpha \frac{H_{i,t}}{S_t}D_t$ .
Assign token reward:

$\tilde r_{i,t} = \begin{cases} r_i + \alpha \frac{H_{i,t}}{S_t} D_t & \text{if } r_i=1 \ 0 & \text{if } r_i=0 \end{cases}$

with $\alpha > 0$ a hyperparameter, $S_t = \sum_{i: r_i=1} H_{i,t}$ .

Normalize $\tilde r_{i,t}$ across the batch for advantage computation; use PPO-style importance weighting and clipped objective (with stop-gradient on entropies).

2.2 Turn-Level Return-Based Advantage and Shaping

For multi-turn tool-integrated reasoning, GTPO reframes the process as a Markov Decision Process (MDP) where each action is a turn $(t_j, c_j)$ . The algorithm assigns turn-level rewards:

$H_{i,t}$ 0, where $H_{i,t}$ 1 is only nonzero for final (terminal) turns and $H_{i,t}$ 2 penalizes syntax errors.
Discounted return:

$H_{i,t}$ 3

Advantage is batch-normalized at each turn index for all trajectories in the sampled group:

$H_{i,t}$ 4

Importantly, reward shaping employs self-supervised signals (e.g., code-embedding similarity to successful trajectories) to densify otherwise binary terminal rewards (Ding et al., 18 Nov 2025).

2.3 Trajectory and Token Conflict Filtering

Trajectory-based GTPO (Simoni et al., 5 Aug 2025) introduces explicit conflict avoidance and entropy regularization mechanisms:

Conflict tokens (tokens occurring in the same position in both positive and negative trajectories) receive double positive weighting but no negative updates, protecting critical structure.
High entropy completions ( $H_{i,t}$ 5) are filtered when the model's initial entropy is below $H_{i,t}$ 6, leveraging a provable threshold ensuring a majority token in the distribution.
The objective omits KL regularization to a reference model, instead directly penalizing average entropy and skipping masked tokens.

2.4 Information Gain and Advanced Turn-Level Signals

A $H_{i,t}$ 7TGPO (Chen et al., 7 May 2026) extends turn-level credit by using per-turn information gain (log-ratio of new and old policy probabilities for the action). It normalizes IG within each prompt/turn “group,” accumulates across the trajectory with variance rescaling, and applies adaptive clipping to PPO-style policy losses, widening or narrowing the trust region based on informativeness.

3. Pseudocode and Training Workflow

Each GTPO variant follows a structured sampling-update loop. The following pseudocode sketches the core steps of turn-level GTPO with code-shaping (Ding et al., 18 Nov 2025) (notation as in the referenced work):

$t$ 6

Salient steps in different GTPO variants include stop-gradient computation for entropy weights, conflict masking for token updates, and group-wise normalization for information gain.

4. Empirical Results and Benchmark Evaluations

Extensive experimentation demonstrates the empirical advantages of GTPO and its descendants:

Token-level GTPO and GRPO-S (sequence-level entropy shaping) both induce an “entropy rebound,” promoting longer and more exploratory chain-of-thought generations. GTPO consistently outperforms DAPO and outstrips GRPO-S in best- $H_{i,t}$ 8 performance metrics on mathematical reasoning tasks, including Qwen2.5-32B experiments (Tan et al., 6 Aug 2025).
Turn-level GTPO achieves a $H_{i,t}$ 9 average relative gain over GRPO on AIME 2024/25, AMC23, MATH500, and SVAMP, with ablations confirming the necessity of turn-level rewards, discounting, and shaped signals (Ding et al., 18 Nov 2025).
Trajectory-based GTPO eliminates policy collapse evident in GRPO, achieves higher pass@ $t$ 0 and maj@ $t$ 1 across benchmarks (GSM8K, MATH, AIME 2024), and improves both formatting stability and accuracy (Simoni et al., 5 Aug 2025).
A $t$ 2TGPO demonstrates consistent gains in exact-match accuracy (+1.75 multi-hop, +1.69 single-hop) across several backbone LLMs, with each of turn-group normalization, variance-rescaling, and adaptive clipping contributing cumulatively to performance (Chen et al., 7 May 2026).

These results substantiate the central claim that fine-grained, context-aware credit assignment directly improves the reasoning capabilities, stability, and sample efficiency of RL-tuned LLMs.

5. Theoretical Properties, Analysis, and Limitations

The theoretical motivations and trade-offs of GTPO algorithms include:

Variance Reduction: Token-level reward averaging in GTPO yields lower variance versus GRPO’s sequence-then-token averaging, preserving total expected reward and ensuring convergence properties, as the expected gradient is a $t$ 3 scaling of standard GRPO (Tan et al., 6 Aug 2025).
Conflict Resolution: Skipping negative updates on conflict tokens ensures structural information is preserved, resolving a longstanding pathology in group-based policy optimization (Simoni et al., 5 Aug 2025).
Entropy Thresholding: The $t$ 4 threshold for filtering is provably the point at which no token achieves probability $t$ 5, enforcing a form of structured confidence in outputs, but may be overly conservative for high-entropy base models or domains valuing diversity (Simoni et al., 5 Aug 2025).
Computational Overheads: Most GTPO variants require computation or storage of per-token/turn entropies, group-normalized statistics, or code embeddings, incurring moderate additional memory and compute.
Scope of Applicability: While GTPO generalizes to a variety of LLM reasoning tasks, extensions to very large models, open-ended tasks without ground-truth, or more complex multi-modal agent environments remain open research directions (Chen et al., 7 May 2026, Ding et al., 18 Nov 2025).

Several trajectories for further development and application of GTPO have been posited:

Generalization to Preference Optimization: Combining dynamic entropy or information gain weighting with DPO-style preference optimization and learned token-level reward predictors (Tan et al., 6 Aug 2025).
Process Credit via New Intrinsic Signals: Beyond entropy and information gain, future work may harness task-specific process signals, per-turn value estimation, or uncertainty reduction for unsupervised credit assignment.
Scalability: Empirical studies have thus far focused on LLMs up to 32B parameters or fewer; scaling to 70B+ or very long-horizon tasks poses computational and optimization challenges (Ding et al., 18 Nov 2025).
Enriched Tool Integration: Incorporating richer tool APIs, program semantics, or multi-modal grounding into GTPO frameworks to extend beyond code-centric TIR tasks.
Adaptive and Amortized Credit Models: Investigating learned credit assignment models, adaptive normalization, or amortized IG estimation for open-ended and multi-episode settings (Chen et al., 7 May 2026).

A plausible implication is that GTPO principles, especially when paired with robust process or uncertainty signals, are likely to become central to RL-based LLM alignment as the field moves toward more agentic, tool-augmented, and partially supervised deployment scenarios.

7. Summary

Group Turn Policy Optimization (GTPO) collectively refers to a set of algorithmic innovations that overcome major credit-assignment and stability challenges in RL for LLMs by distributing reward signals more granularly across tokens, turns, or process steps. By leveraging entropy-based and process-aware shaping, GTPO robustly improves both accuracy and format consistency, directly addressing the limitations of coarse sequence-level RL. Recent instantiations—including token-level entropy shaping, turn-level return-based shaping with self-supervision, and information-gain-guided process optimization—demonstrate systematic empirical benefits, theoretical robustness, and broaden the applicability of group-based RL methods for LLM alignment and reasoning (Tan et al., 6 Aug 2025, Ding et al., 18 Nov 2025, Chen et al., 7 May 2026, Simoni et al., 5 Aug 2025).