Step Reward Group Policy Optimization (SRGPO)

Updated 29 March 2026

SRGPO is a reinforcement learning technique that extends group-based policy methods with dense, per-step reward shaping for finer credit assignment.
It leverages entropy-weighted advantages and group normalization to reduce gradient variance and accelerate learning in structured tasks.
Empirical results demonstrate enhanced performance in mathematical reasoning, vision-language navigation, and multimodal scenarios with improved stability.

Step Reward Group Policy Optimization (SRGPO) is a family of reinforcement learning (RL) algorithms that extends Group Relative Policy Optimization (GRPO) to address the problem of coarse-grained credit assignment in long-horizon, structured reasoning tasks—most notably for LLMs and Multimodal LLMs (MLLMs). SRGPO methods utilize dense, step-level reward shaping and group normalization to deliver fine-grained policy updates, improving both credit assignment and learning efficiency across domains such as mathematical reasoning and vision-language navigation (Tan et al., 6 Aug 2025, Wang et al., 2 Dec 2025, Zhang et al., 17 Mar 2025).

1. Definition and Core Principles

SRGPO generalizes GRPO by introducing dense, per-step reward signals in place of (or addition to) sparse, sequence-level terminal feedback. The central innovation is the design of step-wise rewards that can be confidently aggregated and compared across sampled trajectories or across reasoning steps, enabling more precise credit assignment and lower-variance gradient estimates.

SRGPO frameworks inherit the group-based policy gradient structure of GRPO, but replace the uniform per-token or per-step advantage computation with step-modulated or entropy-weighted advantages. This enables the RL signal to focus on "critical" or "uncertain" steps, as opposed to distributing credit uniformly across all tokens or actions (Tan et al., 6 Aug 2025).

2. Mathematical Formalization

Let $\pi_\theta(a_t|s_t)$ denote the model’s stochastic policy at step $t$ , parameterized by $\theta$ . Given a group of $G$ trajectories $\{\tau_i\}$ , each consisting of $(s_{i,1}, a_{i,1}, ..., s_{i,T}, a_{i,T})$ and an original scalar return $r^{i}_{\mathrm{seq}}$ (typically based on task success), SRGPO replaces the scalar with per-step shaped rewards $r_{i,t}$ .

A key mechanism in SRGPO is entropy-based reward shaping. For each correct trajectory ( $r^{i}_{\mathrm{seq}}=1$ ), a step-level reward is defined as:

$r_{i,t} = \begin{cases} 1 + \alpha \left(\frac{H_{i,t}}{S_t}\right)\left(\frac{1}{d_t}\right) & \text{if } r^{i}_{\mathrm{seq}} = 1 \ 0 & \text{otherwise} \end{cases}$

where $t$ 0 is the normalized Shannon entropy of the model's conditional distribution at step $t$ 1, $t$ 2 is an entropy normalizer across successes, $t$ 3 is the number of trajectories alive at $t$ 4, and $t$ 5 is an entropy bonus hyperparameter (Tan et al., 6 Aug 2025).

The step-wise reward is then re-weighted by a dynamic entropy weight $t$ 6, focusing the policy gradient magnitude on high-uncertainty steps:

$t$ 7

Finally, SRGPO applies a reinforce/PPO-style policy gradient loss on the batch of all $t$ 8 pairs:

$t$ 9

with clipped importance weighting and normalization for stability.

In vision-language navigation (VLN), SRGPO generalizes this reward assignment further through verifiable process rewards $\theta$ 0 (e.g., distance to goal, target-object visibility), and computes step-level standardized advantages by forming random groups of step records within a training batch (Wang et al., 2 Dec 2025).

3. Algorithmic Workflow and Pseudocode

The SRGPO algorithm proceeds via the following high-level steps:

Trajectory Sampling: For each prompt or task, a group of $\theta$ 1 trajectories is sampled using the current policy.
Step-level Reward Computation: Compute per-step entropy or problem-specific dense process rewards.
Reward Normalization: Calculate group-wise entropy normalizers and standardize advantages across all tokens or steps.
Gradient Weighting: Compute dynamic entropy weights or combine episode- and step-level advantages (with fixed balance parameters, e.g., $\theta$ 2).
Policy Update: Aggregate policy gradients using clipped importance weights and update parameters via backpropagation.
Hyperparameter Control: Tune $\theta$ 3, group sizes, and if necessary, KL penalties or length penalties to ensure stable learning.

Typical pseudocode for SRGPO includes:

Sampling groups of trajectories,
Computing token/step-wise entropies and rewards,
Batch-wise standardization of advantages,
PPO-style surrogate loss with clipping,
Gradient step on policy parameters (Tan et al., 6 Aug 2025, Wang et al., 2 Dec 2025).

Random grouping of step-level records (instead of requiring matched states or trajectories) enables scalable, low-variance estimates in stochastic environments (Wang et al., 2 Dec 2025).

4. Theoretical and Empirical Properties

SRGPO preserves the update direction of the underlying policy gradient objective, with the expectation of the SRGPO gradient aligning with the standard GRPO estimator but with improved variance properties due to finer credit assignment and global token-level normalization (Tan et al., 6 Aug 2025). Empirically, SRGPO demonstrates:

Lower gradient variance and accelerated convergence, since the learning signal is concentrated on informationally rich steps.
Substantial improvements on mathematical reasoning and long-horizon tasks, with mean reward increases of 5–8 percentage points over DAPO baselines, and higher success rates and generalization capabilities in navigation tasks (Tan et al., 6 Aug 2025, Wang et al., 2 Dec 2025).
Robustness of performance to hyperparameter choices (notably the entropy bonus $\theta$ 4), but potential instability if the bonus dominates the base reward.
Quantitative gains in both in-distribution and out-of-distribution settings, e.g., in VLN, SRGPO outperforms GRPO and GiGPO by 10–15 percentage points in zero-shot generalization (Wang et al., 2 Dec 2025).
In multimodal reasoning, dense SRGPO-style rewards (e.g., StepGRPO) outperform outcome-level RL and supervised fine-tuning, with step-wise rule-based reward shaping (Zhang et al., 17 Mar 2025).

A summary of experimental highlights:

Domain	Baseline	SRGPO Variant	Notable Result
Math Reasoning	DAPO	GTPO/SRGPO	+5–8 pts mean reward
VLN (Navigation)	GRPO, GiGPO	SRGPO + verifiable steps	+10–15 pp o.o.d. generalization
Multimodal Bench	SFT, GRPO	StepGRPO	+3–5 pts accuracy on hard tasks

Multiple instantiations of SRGPO have emerged across modalities:

GTPO (Group Token Policy Optimization) and GRPO-S: Token- and sequence-level entropy-weighted variants, emphasizing different axes of reward granularity (Tan et al., 6 Aug 2025).
StepGRPO: Step-level, rule-based reward assignment for multimodal language reasoning, combining pattern matching and logic evaluation (Zhang et al., 17 Mar 2025).
Navigation SRGPO: Incorporates verifiable, step-specific metrics such as position-to-goal, enabling credit assignment in settings with process verifiability rather than sparse success/failure (Wang et al., 2 Dec 2025).

In all published comparisons, SRGPO variants outperform both classical PPO, DAPO, and group-based RL baselines when measured on metrics such as pass@1, navigation success rate, sample efficiency, and credit-assignment sharpness. SRGPO’s dense credit and low-variance gradients accelerate RL convergence, especially for long chains of reasoning or navigation steps.

Ablation studies consistently show that removing step-level bonuses or reducing them to uniform (DAPO-style) assignment degrades performance; overly large bonuses can destabilize training due to excessive variance contributions (Tan et al., 6 Aug 2025).

6. Implementation Guidance and Practical Considerations

Key practical recommendations include:

Hyperparameter Selection: Set the entropy torque $\theta$ 5 in the range [0.5, 1.5] for stability; group size $\theta$ 6 of 16–32 for LLMs, 4 for vision-language navigation; standard PPO clipping thresholds; and policy learning rates mirroring RLHF/PPO defaults.
Reward Shaping: Detach entropy computations from the gradient graph to ensure unbiased gradient flow (Tan et al., 6 Aug 2025).
Monitoring and Tuning: Track the ratio of the entropy bonus to the base reward, and apply moderate length penalties if output lengths become excessive.
Generalization: Use random grouping for step-level advantages in tasks where state dependence is weak or process rewards are verifiable across contexts (Wang et al., 2 Dec 2025).
Hardware/Scaling: SRGPO is designed for high-throughput, group-based batch processing and is directly compatible with existing PPO/GRPO infrastructure.

7. Relationship to Adjacent Methods and Future Directions

SRGPO occupies a central role in recent advances in LLM and MLLM reinforcement learning. Its development stems partly from the need to overcome the limitations of coarse terminal rewards and the inefficiency of step-grouping with matched states (as required by GiGPO). Variants such as WS-GRPO (Mundada et al., 19 Feb 2026) further generalize this by integrating learned prefix-level rewards for rollout efficiency, though these introduce additional preference models and architectural complexity.

A plausible implication is that future evolution of SRGPO-style techniques will incorporate adaptive or model-based reward shaping, richer process verifiability, and broader application to decision making in non-textual domains. Current evidence suggests that dense, entropy- and process-aware reward assignment remains the dominant methodology for efficient long-horizon RL training in high-capacity generative models.

Markdown Report Issue Upgrade to Chat

References (4)

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy (2025)

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization (2025)

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization (2025)

WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Step Reward Group Policy Optimization (SRGPO).

Step Reward Group Policy Optimization (SRGPO)

1. Definition and Core Principles

2. Mathematical Formalization

3. Algorithmic Workflow and Pseudocode

4. Theoretical and Empirical Properties

6. Implementation Guidance and Practical Considerations

7. Relationship to Adjacent Methods and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Step Reward Group Policy Optimization (SRGPO)

1. Definition and Core Principles

2. Mathematical Formalization

3. Algorithmic Workflow and Pseudocode

4. Theoretical and Empirical Properties

5. Variants, Related Approaches, and Benchmark Results

6. Implementation Guidance and Practical Considerations

7. Relationship to Adjacent Methods and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research