Trajectory-Wise Group Relative Policy Optimization

Updated 24 September 2025

The paper introduces TGRPO, a trajectory-level extension to GRPO that replaces value functions with group normalized advantage signals for critic-free updates.
It fuses step- and trajectory-level advantage estimates, reducing variance and boosting sample efficiency in long-horizon decision tasks.
TGRPO shows promising applications in vision-language models, robotic control, and language model reasoning with significant performance gains.

Trajectory-Wise Group Relative Policy Optimization (TGRPO) is a reinforcement learning framework that generalizes and extends Group Relative Policy Optimization (GRPO) to operate at the trajectory level, thereby addressing challenges in long-horizon decision tasks, robotics, and LLM alignment. By leveraging group comparisons across entire trajectories and fusing step- and sequence-level advantage estimation, TGRPO enhances sample efficiency, variance reduction, and policy stability. The approach has proven effective for fine-tuning vision-language-action models, generalist robotic control, and LLM reasoning, among other applications.

1. Principles and Formulation of TGRPO

TGRPO is built on the fundamental concept of group relative policy optimization, which replaces the conventional critic (value function) in policy gradient methods with group-normalized advantage signals. In the trajectory-wise variant, the optimization grouping shifts from individual actions or tokens to entire trajectories, capturing both local and global outcome information.

The core policy loss adopts a PPO-like structure but aggregates group-wise advantage statistics over trajectories. For a group of $n$ trajectories $\{\tau_i\}$ , each trajectory $\tau_i$ receives an advantage $A_{\text{traj}}^i$ computed as the standardized deviation from the group's average and variance:

$A_{\text{traj}}^i = \frac{R_i - \frac{1}{n}\sum_{j=1}^n R_j}{\sqrt{\frac{1}{n-1}\sum_{j=1}^n (R_j - \frac{1}{n}\sum_{k=1}^n R_k)^2}}$

where $R_i$ is the trajectory reward (e.g., cumulative or terminal), and the denominator is the groupwise reward standard deviation. This group normalization can be complemented by additional per-step advantage signals:

$A_{\text{step}}^{i,t} = \frac{R_{i,t} - \frac{1}{n}\sum_{j=1}^n R_{j,t}}{\sqrt{\frac{1}{n-1}\sum_{j=1}^n (R_{j,t} - \frac{1}{n}\sum_{k=1}^n R_{k,t})^2}}$

The fused advantage used in TGRPO is then

$\text{Adv}^{i, t} = \alpha_1\, A_{\text{step}}^{i,t} + \alpha_2\,A_{\text{traj}}^i$

where $\alpha_1, \alpha_2$ are hyperparameters balancing the contribution of local and global outcome signals (Chen et al., 10 Jun 2025).

The trajectory-level importance ratio—either token-level or full trajectory-level—is typically used for off-policy correction:

$w'(\tau_i, \theta, \theta_{\text{old}}) = \frac{P_\theta(\tau_i)}{P_{\theta_{\text{old}}}(\tau_i)}$

This unbiased importance sampling enables critic-free, stable updates, as in TIC-GRPO (Pang et al., 4 Aug 2025).

The PPO-inspired policy loss (for a group $G$ with size $|G|$ ) is:

$J(\theta) = \frac{1}{|G|} \sum_{i=1}^{|G|} \frac{1}{|\tau_i|} \sum_t \min \{ w'(\tau_i, \theta, \theta_{\text{old}})\, \text{Adv}^{i, t},\, \text{clip}(w', 1-\epsilon, 1+\epsilon)\, \text{Adv}^{i, t} \} - \beta\, D_{KL}(\pi_\theta \| \pi_{\text{ref}})$

with optional KL regularization for trust region constraint.

2. Variance Reduction and Sample Efficiency

One major benefit of TGRPO is optimal trajectory-wise variance reduction. The trajectory-wise control variate theory (Cheng et al., 2019) demonstrates that trajectory-based adjustment targets not only per-action variance but also long-term covariance induced by environment dynamics. This global variance cancellation leads to faster and more robust convergence in long-horizon or deterministic domains. Integrating partial trajectory reuse via importance sampling and mixture likelihood ratios further lowers estimator variance, particularly in low data or dynamic regimes (Zheng et al., 2022).

A representative trajectory-wise gradient estimator is:

$\hat{\nabla J}^{MLR} = \frac{1}{|U_k|} \sum_{i \in U_k} \frac{1}{n}\sum_{j=1}^n f_k(s_t^{(i,j)}, a_t^{(i,j)})\,g_k(s_t^{(i,j)}, a_t^{(i,j)})$

where $f_k$ is the mixture likelihood ratio and $U_k$ selects trajectory groups with bounded gradient variance.

3. Practical Methodologies and Policy Update

TGRPO-based RL training involves:

Sampling multiple trajectories per environment or prompt, grouping them per task or by user-defined structure.
Computing both trajectory-level and, optionally, per-step advantages, normalized within each group.
Evaluating policy updates via PPO-style clipped surrogate losses, optionally using trajectory-wise importance ratios for unbiased policy gradient estimation (Pang et al., 4 Aug 2025).
Optionally constraining policy updates with KL regularization for trust region stability.
Updating model parameters directly using fused trajectory-level signals—no value/critic network is required.
In some implementations, especially for robotic control or generalist policies, partial trajectory reuse and experience replay are adopted for further sample efficiency (Zheng et al., 2022), and group sizes/trajectory lengths are conservatively set for computational robustness (Zhang et al., 18 Sep 2025).

4. Key Applications and Empirical Findings

TGRPO has demonstrated utility across diverse domains:

Vision-Language-Action (VLA) Models: Online RL fine-tuning using TGRPO yields higher success rates in LIBERO manipulation tasks (91.0% average success vs 86.6% for PPO/SFT), with stable advantage estimation and more robust policy generalization (Chen et al., 10 Jun 2025).
Robotics and Flow-Matching Policies: In simulated unicycle and humanoid locomotion, trajectory-wise GRPO eliminates the need for a separate value model and leverages learned reward surrogates for adaptive minimum-time control and policy neural planning, achieving 50%–85% cost reductions relative to naive imitation baselines (Pfrommer et al., 20 Jul 2025, Nguyen et al., 19 May 2025).
LLM Reasoning and Alignment: TGRPO and trajectory-wise extensions of GRPO (including TIC-GRPO) provide stable, critic-free policy gradient updates and faster convergence for LLM alignment using verifiable or binary rewards (Pang et al., 4 Aug 2025, Mroueh, 9 Mar 2025). GTPO and GTPO-S further improve token/sequence-level credit assignment in reasoning tasks using entropy-aware signals (Tan et al., 6 Aug 2025).
Wireless Systems Optimization: TGRPO underlies efficient optimization of fluid antenna systems, outperforming PPO in sum-rate with only half the computational cost, and showing that moderate group sizes and trajectory lengths suffice for robust learning (Zhang et al., 18 Sep 2025).
Hyperparameter Optimization: GRPOformer's integration of trajectory construction and group-wise policy updates enables rapid, stable hyperparameter search using transformers, with Policy Churn Regularization boosting stability (Guo et al., 21 Sep 2025).

5. Theoretical Properties and Convergence Analysis

TGRPO generalizes the convergence dynamics of group relative updates to the trajectory level. Theoretical analysis confirms that the iterative policy update amplifies the probability of success for verifiable (binary) rewards at the trajectory level, converging to a fixed point $p^*$ superior to the reference policy’s performance (Mroueh, 9 Mar 2025):

$p_n(q) = h_{ε, p_{\text{ref}}}(p_{n-1}(q))$

with explicit closed-form policy updates and proven amplification. For importance-corrected TGRPO, trajectory-level importance ratios yield unbiased gradient estimation and sample-efficient convergence rates of $O(\eta K) + O(1/G)$ (Pang et al., 4 Aug 2025).

6. Impact, Limitations, and Future Development

TGRPO represents a general trajectory-wise framework for reinforcement learning in sequential tasks with complex temporal dependencies. Its impact is notable in:

Enabling critic-free, sample-efficient RL optimization.
Reducing variance and increasing stability, particularly in environments with sparse or delayed rewards.
Simplifying the complexity of policy network architectures and update rules.
Allowing direct groupwise preference alignment and relative comparison at the trajectory level, with successful empirical performance in both simulated and real-world applications.

Key limitations and future directions include:

Sensitivity and manual calibration of step vs. trajectory advantage fusion parameters ( $\alpha_1,\alpha_2$ ).
Potential improvements via adaptive or learned advantage weighting.
Extension of grouping strategies beyond episode-level granularity, e.g., by sub-tasks or critical trajectory events.
Further empirical validation in highly dynamic or open-ended settings, including online robotics and LLM reasoning (Chen et al., 10 Jun 2025, Khanda et al., 25 Jul 2025, Nguyen et al., 19 May 2025).
Development of robust partial trajectory selection and replay mechanisms (Zheng et al., 2022).

7. Comparison Table: Key Variants of Trajectory-Wise GRPO

Method	Advantage Signal(s)	Importance Ratio	Critic Network	KL Constraint	Application Area
TGRPO (Chen et al., 10 Jun 2025)	Step-level + trajectory-level	token or trajectory	Not used	Optional	VLA/RL, robotics, LLM alignment
TIC-GRPO (Pang et al., 4 Aug 2025)	Trajectory-level	trajectory-level	Not used	Optional	LLM alignment
GTPO (Simoni et al., 5 Aug 2025)	Conflict-corrected trajectory/tokens	token-level w/ masks	Not used	Not used	LLM reasoning, formatting stability
VRER (Zheng et al., 2022)	Group/partial trajectory-wise	mixture likelihood	Used	Optional	Actor-critic, adaptive process control

A plausible implication is that trajectory-level optimization frameworks, including TGRPO and its variants, have established themselves as robust, theoretically grounded, and computationally efficient alternatives to traditional critic-based RL algorithms, with significant impact in generalist robotics, LLM alignment, and hierarchical sequential decision settings.