GTPO: Grouped Turn-wise PPO
- GTPO is a reinforcement learning algorithm designed for multi-turn tool-integrated tasks, treating each dialogue turn as a step in a Markov decision process.
- It utilizes turn-level reward assignment, groupwise normalized advantage estimation, and self-supervised reward shaping to enhance training stability and performance.
- Empirical results show GTPO’s superior performance over GRPO and token-level PPO in tasks such as code synthesis, mathematical reasoning, and multi-hop question answering.
Grouped Turn-wise Policy Optimization (GTPO), also referred to as Multi-Turn PPO (MT-PPO), is a class of reinforcement learning (RL) algorithms designed to enhance multi-turn reasoning and tool-integration capabilities in LLMs. GTPO reformulates long-horizon tool-augmented tasks as multi-step Markov decision processes (MDPs) at the granularity of dialogue or action turns, assigning fine-grained rewards, performing return-based advantage estimation, and leveraging structured credit assignment to resolve challenges associated with sparse, trajectory-level RL signals. It is widely deployed for agentic LLM training in environments necessitating multi-step reasoning, code synthesis, and dynamic execution (Ding et al., 18 Nov 2025, Wei et al., 17 May 2025).
1. Problem Setting: Multi-Turn Tool-Integrated Reasoning (TIR)
GTPO is motivated by the limitations of conventional algorithms such as Group Relative Policy Optimization (GRPO), which assign a single, trajectory-level outcome reward for an entire sequence of LLM–tool or LLM–environment interactions. In Tool-Integrated Reasoning (TIR), the LLM iteratively engages with an external tool (e.g., a code interpreter), where at turn it generates both natural language and executable code, receives tool feedback , and synthesizes state . Standard trajectory-level RL collapses potentially rich intermediate signals and loses temporal credit assignment, impeding training stability and informativeness.
GTPO approaches the TIR task as a proper multi-turn MDP, where each turn forms a state–action–reward–transition tuple, and fine-grained feedback facilitates improved learning signals (e.g., code format correctness, intermediate retrieval quality, or partial answer validity) (Ding et al., 18 Nov 2025).
2. Algorithmic Formulation
The GTPO framework consists of several core components:
2.1 Turn-Level Reward Assignment
Every trajectory and turn is assigned a reward:
where is an accuracy reward (1 if the final output is correct and ; 0 otherwise) and is a penalty for code syntax/format errors ( for errors, 0 otherwise) (Ding et al., 18 Nov 2025). In broader GTPO/MT-PPO instantiations, richer intermediate rewards (e.g., verifiable retrieval success, correct tag emission, or LLM-as-judge rubric scores in 0) can be employed as well (Wei et al., 17 May 2025).
2.2 Return-Based Advantage Estimation
For each trajectory and turn, GTPO computes a normalized, discounted return:
1
Advantages are then standardized across a group of 2 parallel rollouts:
3
This group-wise normalization provides low-variance, relative advantage signals and stabilizes gradient estimates (Ding et al., 18 Nov 2025).
2.3 Self-Supervised Reward Shaping
To densify reward signals, particularly for failed terminal trajectories, GTPO introduces reward shaping via code similarity. For any 4 (trajectories without a correct final answer), a reward proportional to the maximum embedding-similarity to code from correct trajectories is assigned:
5
where 6 is the concatenated code up to turn 7 and 8 is an off-the-shelf embedding similarity (e.g., Titan Embeddings V2) (Ding et al., 18 Nov 2025). This supplements sparse binary feedback with granular, self-supervised signals.
2.4 Surrogate Objective and Optimization
GTPO extends the PPO-clipped surrogate objective over grouped turns and tokens:
9
where 0 and policy/critic updates proceed via AdamW or similar (Ding et al., 18 Nov 2025). The group structure enables both per-turn and per-token optimization, supporting dense feedback assignment.
3. Relation to Alternative Multi-Turn RL Methods
GTPO is a direct response to the empirical and theoretical limitations of GRPO and standard token-level PPO in multi-turn agent settings:
- GRPO: Adopts a group-based normalization but assigns a sole scalar advantage per trajectory, disregarding turn-level heterogeneity. Lacks a learned critic and is prone to instability and high variance, especially in long-horizon settings (Wei et al., 17 May 2025, Li et al., 18 Dec 2025).
- Token-level PPO: Computes advantages and importance weights for every token. While more fine-grained than GRPO, it is misaligned with the natural turn structure of multi-turn environments and is susceptible to gradient instability, especially when agent and environmental tokens are intermixed (Li et al., 25 Nov 2025, Li et al., 18 Dec 2025).
- Turn-PPO/ST-PPO: Related approaches such as Turn-PPO and ST-PPO perform PPO updates at the turn level, with either turn-wise or stabilized advantage estimation and importance sampling, yielding improved stability, lower-variance updates, and better resistance to catastrophic training collapse in long-horizon tasks (Li et al., 25 Nov 2025, Li et al., 18 Dec 2025).
A key difference is that GTPO, especially in its tool-integrated instantiations, incorporates additional self-supervised shaping and groupwise normalization, specifically tailored to multi-turn code+tool-agentic settings (Ding et al., 18 Nov 2025).
4. Empirical Performance and Ablation Insights
GTPO has been empirically evaluated on various benchmarks emphasizing multi-turn mathematical and reasoning capabilities, including AIME 2024/2025, MATH 500, AMC 2023, and SVAMP. Key results include:
- Average pass rate: GTPO yielded 51.26% versus 49.78% for GRPO, representing a 3.0% relative improvement under identical conditions.
- Task-specific gains: Largest improvements observed on AIME 2024 (+8.0% over both prompting and GRPO baselines), MATH 500, and SVAMP.
- Ablation studies:
- Removing turn-level rewards decreased average performance by 2–3 points.
- Disabling discounting (γ=1.0) led to a 0.7 point drop.
- Eliminating reward shaping reduced scores by ~0.8 points.
- Scalability: Performance increases monotonically with turn count (T) elevated from 1 to 3, indicating benefits from leveraging longer reasoning chains.
- Best hyperparameters: Optimal discount factor γ=0.9. Embedding-based code similarity outperforms character-level or full-trajectory reward shaping (Ding et al., 18 Nov 2025).
Experiments on general QA and multi-hop QA tasks further substantiate that turn-level reward design, as instantiated in GTPO, promotes faster convergence, higher accuracy, and dramatically greater format correctness relative to both GRPO and step-level PPO (Wei et al., 17 May 2025).
5. Training and Implementation Protocol
A typical GTPO training cycle proceeds with:
- Rollout Collection: Generate G on-policy rollouts per initial prompt, track per-turn rewards and tool outputs.
- Reward Shaping: Compute code similarity-based self-supervised rewards for failed trajectories at the terminal turn.
- Advantage Computation: Calculate discounted returns and group-normalized advantages per turn.
- Policy Update: Optimize the PPO-style clipped surrogate with respect to all turns and tokens.
- Iteration: Repeat until convergence or desired evaluation metric is achieved (Ding et al., 18 Nov 2025, Wei et al., 17 May 2025).
The method is agnostic to backbone architectures and compatible with common LLM transformer families (e.g., Qwen2.5-7B). Practical implementation uses high-throughput distributed RL infrastructure (e.g., vLLM, FSDP, gradient checkpointing) and structures agent–environment interaction traces such that turn demarcations are clearly defined for grouping (Wei et al., 17 May 2025, Ding et al., 18 Nov 2025).
6. Scope, Generality, and Limitations
GTPO encompasses a generic framework for multi-turn RL with per-turn credit assignment, groupwise advantage normalization, and flexible reward shaping. It does not require prompt engineering or data augmentation, generalizes to arbitrary agentic LLM applications with intermediate signals, and can be extended to diverse domains, including web-shop navigation, mathematical problem solving, and knowledge-augmented retrieval (Wei et al., 17 May 2025, Ding et al., 18 Nov 2025, Li et al., 18 Dec 2025).
This suggests potential applicability to environments with more complex or noisy feedback structures, though most published results focus on domains with accessible verifiable or rubric-based rewards.
Limitations include scalability to extremely long-horizon settings (hundreds of turns), where variance management or hierarchical extensions are required, and the need for careful design of turn boundaries and feedback signals to avoid ambiguity or reward hacking (Li et al., 18 Dec 2025).
7. Comparative Summary
The table below summarizes methodology and empirical distinctions among leading multi-turn RL algorithms relevant to GTPO.
| Method | Credit Assignment | Advantage Estimation | Reward Granularity |
|---|---|---|---|
| GRPO | Trajectory-level | Final reward normalization | Binary |
| Token-PPO | Token-level | GAE, token | Sparse, noisy |
| Turn-PPO | Turn-level | GAE, turn | Turn |
| GTPO | Turn-level, grouped | Groupwise, normalized | Turn + shaped |
GTPO integrates the strengths of turn-level assignment, group-based normalization, and self-supervised shaping, achieving consistent gains in stability and final task performance over alternatives such as GRPO, token-level PPO, and Turn-PPO (Ding et al., 18 Nov 2025, Wei et al., 17 May 2025, Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).