Papers
Topics
Authors
Recent
Search
2000 character limit reached

GTPO: Grouped Turn-wise PPO

Updated 15 March 2026
  • GTPO is a reinforcement learning algorithm designed for multi-turn tool-integrated tasks, treating each dialogue turn as a step in a Markov decision process.
  • It utilizes turn-level reward assignment, groupwise normalized advantage estimation, and self-supervised reward shaping to enhance training stability and performance.
  • Empirical results show GTPO’s superior performance over GRPO and token-level PPO in tasks such as code synthesis, mathematical reasoning, and multi-hop question answering.

Grouped Turn-wise Policy Optimization (GTPO), also referred to as Multi-Turn PPO (MT-PPO), is a class of reinforcement learning (RL) algorithms designed to enhance multi-turn reasoning and tool-integration capabilities in LLMs. GTPO reformulates long-horizon tool-augmented tasks as multi-step Markov decision processes (MDPs) at the granularity of dialogue or action turns, assigning fine-grained rewards, performing return-based advantage estimation, and leveraging structured credit assignment to resolve challenges associated with sparse, trajectory-level RL signals. It is widely deployed for agentic LLM training in environments necessitating multi-step reasoning, code synthesis, and dynamic execution (Ding et al., 18 Nov 2025, Wei et al., 17 May 2025).

1. Problem Setting: Multi-Turn Tool-Integrated Reasoning (TIR)

GTPO is motivated by the limitations of conventional algorithms such as Group Relative Policy Optimization (GRPO), which assign a single, trajectory-level outcome reward for an entire sequence of LLM–tool or LLM–environment interactions. In Tool-Integrated Reasoning (TIR), the LLM iteratively engages with an external tool (e.g., a code interpreter), where at turn jj it generates both natural language and executable code, receives tool feedback bjb_j, and synthesizes state sj={x0,y1,b1,...,yj−1,bj−1}s_j = \{x_0, y_1, b_1, ..., y_{j-1}, b_{j-1}\}. Standard trajectory-level RL collapses potentially rich intermediate signals and loses temporal credit assignment, impeding training stability and informativeness.

GTPO approaches the TIR task as a proper multi-turn MDP, where each turn forms a state–action–reward–transition tuple, and fine-grained feedback facilitates improved learning signals (e.g., code format correctness, intermediate retrieval quality, or partial answer validity) (Ding et al., 18 Nov 2025).

2. Algorithmic Formulation

The GTPO framework consists of several core components:

2.1 Turn-Level Reward Assignment

Every trajectory ii and turn jj is assigned a reward:

ri,j=racci,j+rformati,jr_{i,j} = r_{\text{acc}_{i,j}} + r_{\text{format}_{i,j}}

where racci,jr_{\text{acc}_{i,j}} is an accuracy reward (1 if the final output is correct and j=Tj=T; 0 otherwise) and rformati,jr_{\text{format}_{i,j}} is a penalty for code syntax/format errors (−0.1-0.1 for errors, 0 otherwise) (Ding et al., 18 Nov 2025). In broader GTPO/MT-PPO instantiations, richer intermediate rewards (e.g., verifiable retrieval success, correct tag emission, or LLM-as-judge rubric scores in bjb_j0) can be employed as well (Wei et al., 17 May 2025).

2.2 Return-Based Advantage Estimation

For each trajectory and turn, GTPO computes a normalized, discounted return:

bjb_j1

Advantages are then standardized across a group of bjb_j2 parallel rollouts:

bjb_j3

This group-wise normalization provides low-variance, relative advantage signals and stabilizes gradient estimates (Ding et al., 18 Nov 2025).

2.3 Self-Supervised Reward Shaping

To densify reward signals, particularly for failed terminal trajectories, GTPO introduces reward shaping via code similarity. For any bjb_j4 (trajectories without a correct final answer), a reward proportional to the maximum embedding-similarity to code from correct trajectories is assigned:

bjb_j5

where bjb_j6 is the concatenated code up to turn bjb_j7 and bjb_j8 is an off-the-shelf embedding similarity (e.g., Titan Embeddings V2) (Ding et al., 18 Nov 2025). This supplements sparse binary feedback with granular, self-supervised signals.

2.4 Surrogate Objective and Optimization

GTPO extends the PPO-clipped surrogate objective over grouped turns and tokens:

bjb_j9

where sj={x0,y1,b1,...,yj−1,bj−1}s_j = \{x_0, y_1, b_1, ..., y_{j-1}, b_{j-1}\}0 and policy/critic updates proceed via AdamW or similar (Ding et al., 18 Nov 2025). The group structure enables both per-turn and per-token optimization, supporting dense feedback assignment.

3. Relation to Alternative Multi-Turn RL Methods

GTPO is a direct response to the empirical and theoretical limitations of GRPO and standard token-level PPO in multi-turn agent settings:

  • GRPO: Adopts a group-based normalization but assigns a sole scalar advantage per trajectory, disregarding turn-level heterogeneity. Lacks a learned critic and is prone to instability and high variance, especially in long-horizon settings (Wei et al., 17 May 2025, Li et al., 18 Dec 2025).
  • Token-level PPO: Computes advantages and importance weights for every token. While more fine-grained than GRPO, it is misaligned with the natural turn structure of multi-turn environments and is susceptible to gradient instability, especially when agent and environmental tokens are intermixed (Li et al., 25 Nov 2025, Li et al., 18 Dec 2025).
  • Turn-PPO/ST-PPO: Related approaches such as Turn-PPO and ST-PPO perform PPO updates at the turn level, with either turn-wise or stabilized advantage estimation and importance sampling, yielding improved stability, lower-variance updates, and better resistance to catastrophic training collapse in long-horizon tasks (Li et al., 25 Nov 2025, Li et al., 18 Dec 2025).

A key difference is that GTPO, especially in its tool-integrated instantiations, incorporates additional self-supervised shaping and groupwise normalization, specifically tailored to multi-turn code+tool-agentic settings (Ding et al., 18 Nov 2025).

4. Empirical Performance and Ablation Insights

GTPO has been empirically evaluated on various benchmarks emphasizing multi-turn mathematical and reasoning capabilities, including AIME 2024/2025, MATH 500, AMC 2023, and SVAMP. Key results include:

  • Average pass rate: GTPO yielded 51.26% versus 49.78% for GRPO, representing a 3.0% relative improvement under identical conditions.
  • Task-specific gains: Largest improvements observed on AIME 2024 (+8.0% over both prompting and GRPO baselines), MATH 500, and SVAMP.
  • Ablation studies:
    • Removing turn-level rewards decreased average performance by 2–3 points.
    • Disabling discounting (γ=1.0) led to a 0.7 point drop.
    • Eliminating reward shaping reduced scores by ~0.8 points.
  • Scalability: Performance increases monotonically with turn count (T) elevated from 1 to 3, indicating benefits from leveraging longer reasoning chains.
  • Best hyperparameters: Optimal discount factor γ=0.9. Embedding-based code similarity outperforms character-level or full-trajectory reward shaping (Ding et al., 18 Nov 2025).

Experiments on general QA and multi-hop QA tasks further substantiate that turn-level reward design, as instantiated in GTPO, promotes faster convergence, higher accuracy, and dramatically greater format correctness relative to both GRPO and step-level PPO (Wei et al., 17 May 2025).

5. Training and Implementation Protocol

A typical GTPO training cycle proceeds with:

  1. Rollout Collection: Generate G on-policy rollouts per initial prompt, track per-turn rewards and tool outputs.
  2. Reward Shaping: Compute code similarity-based self-supervised rewards for failed trajectories at the terminal turn.
  3. Advantage Computation: Calculate discounted returns and group-normalized advantages per turn.
  4. Policy Update: Optimize the PPO-style clipped surrogate with respect to all turns and tokens.
  5. Iteration: Repeat until convergence or desired evaluation metric is achieved (Ding et al., 18 Nov 2025, Wei et al., 17 May 2025).

The method is agnostic to backbone architectures and compatible with common LLM transformer families (e.g., Qwen2.5-7B). Practical implementation uses high-throughput distributed RL infrastructure (e.g., vLLM, FSDP, gradient checkpointing) and structures agent–environment interaction traces such that turn demarcations are clearly defined for grouping (Wei et al., 17 May 2025, Ding et al., 18 Nov 2025).

6. Scope, Generality, and Limitations

GTPO encompasses a generic framework for multi-turn RL with per-turn credit assignment, groupwise advantage normalization, and flexible reward shaping. It does not require prompt engineering or data augmentation, generalizes to arbitrary agentic LLM applications with intermediate signals, and can be extended to diverse domains, including web-shop navigation, mathematical problem solving, and knowledge-augmented retrieval (Wei et al., 17 May 2025, Ding et al., 18 Nov 2025, Li et al., 18 Dec 2025).

This suggests potential applicability to environments with more complex or noisy feedback structures, though most published results focus on domains with accessible verifiable or rubric-based rewards.

Limitations include scalability to extremely long-horizon settings (hundreds of turns), where variance management or hierarchical extensions are required, and the need for careful design of turn boundaries and feedback signals to avoid ambiguity or reward hacking (Li et al., 18 Dec 2025).

7. Comparative Summary

The table below summarizes methodology and empirical distinctions among leading multi-turn RL algorithms relevant to GTPO.

Method Credit Assignment Advantage Estimation Reward Granularity
GRPO Trajectory-level Final reward normalization Binary
Token-PPO Token-level GAE, token Sparse, noisy
Turn-PPO Turn-level GAE, turn Turn
GTPO Turn-level, grouped Groupwise, normalized Turn + shaped

GTPO integrates the strengths of turn-level assignment, group-based normalization, and self-supervised shaping, achieving consistent gains in stability and final task performance over alternatives such as GRPO, token-level PPO, and Turn-PPO (Ding et al., 18 Nov 2025, Wei et al., 17 May 2025, Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GTPO (Grouped Turn-wise PPO).