DS-GRPO: Diffusion Group RL Optimization

Updated 26 November 2025

DS-GRPO is a framework that unifies group-normalized, critic-free reinforcement learning with diffusion-based generative sampling for policy optimization.
It employs innovative group-relative advantage estimation and clipping strategies to stabilize convergence across high-dimensional, ODE/SDE-driven tasks.
Empirical results in text-to-image and text-to-video tasks demonstrate significant gains, achieving improvements up to +181% over baseline approaches.

DS-GRPO (Diffusion Sampling with Group Relative Policy Optimization) refers to a set of algorithms, notably GRPO and TIC-GRPO, that unify group-normalized, critic-free reinforcement learning for diffusion or flow-based generative sampling and policy optimization. DS-GRPO was motivated by limitations of existing RL paradigms—especially Proximal Policy Optimization (PPO)—when applied to the high-dimensional, ODE/SDE-based sampling processes characteristic of modern generative models, such as those used in visual content generation. The framework enables direct fine-tuning of generative policies using sparse human feedback or reward models, eschewing explicit value networks by employing group-wise reward normalization and advantage estimation. It delivers both theoretical convergence guarantees and substantial empirical gains across multiple tasks and architectures (Xue et al., 12 May 2025, Pang et al., 4 Aug 2025).

1. Foundations: GRPO and its Extension to Diffusion Sampling

Group Relative Policy Optimization (GRPO) was originally proposed for robust policy-gradient fine-tuning in LLMs. The key mechanism involves grouping multiple trajectories by prompt or context, computing within-group normalized advantages, and employing PPO-style clipped policy-ratio updates. In the DS-GRPO variant, this methodology is extended to diffusion and rectified-flow models by reinterpreting the generative denoising process as a Markov Decision Process (MDP): each step of the denoising trajectory becomes an RL timestep, with the generative network acting as the policy (Xue et al., 12 May 2025).

Diffusion samplers, typically described by deterministic ODEs, are recast as reverse SDEs:

$d z_t = f_t(z_t) dt + g_t dW_t$

where $f_t$ and $g_t$ derive from the noise schedule, and $W_t$ is Brownian motion. The denoising model is interpreted as parameterizing $f_t$ , introducing stochasticity necessary for RL exploration.

2. Reinforcement Learning Formulation and Algorithmic Components

MDP Setup

States: $s_t = (c, t, z_t)$ , with $z_t$ the latent, $t$ the current timestep, and $c$ the conditioning prompt.
Actions: $a_t = z_{t-1}$ , sampled from the policy $\pi_\theta$ , approximating the conditional next latent under model and prompt.
Transition: Deterministic, advancing via $z_{t-1}$ and decreasing $t$ .
Rewards: Only at terminal state ( $t=0$ ), defined by external black-box models (e.g., HPS-v2.1, CLIP Score, VideoAlign).
Policy: The denoising network (e.g., U-Net or transformer) serves as the RL policy. No explicit value networks are trained; instead, a group-relative baseline is computed.

Group-Normalized Advantages

For a batch of $G$ trajectories per prompt:

Mean reward $\mu = \frac{1}{G}\sum_{i=1}^G r_i$
Stddev $\sigma = \sqrt{\frac{1}{G}\sum_{i=1}^G (r_i - \mu)^2}$
Group-relative advantage $A^g_i = \frac{r_i - \mu}{\sigma}$

Surrogate Objective

The core update maximizes a clipped, group-normalized advantage:

$L(\theta) = -\mathbb{E}_\tau \left[\frac{1}{G}\sum_{i=1}^G \frac{1}{T}\sum_{t=1}^T \min\{\rho_{t,i}A^g_i, \text{clip}(\rho_{t,i}, 1-\epsilon, 1+\epsilon)A^g_i\}\right]$

where $\rho_{t,i} = \pi_\theta(a_{t,i}|s_{t,i})/\pi_{\theta_\text{old}}(a_{t,i}|s_{t,i})$ .

3. Algorithmic Variants: GRPO and TIC-GRPO

GRPO Behavior and Bias

Original GRPO estimates the policy gradient at $\theta_\text{old}$ (the previous policy's parameters), introducing a minor bias. However, when $\theta_\text{old}$ is updated every $K$ steps, this bias is shown to be negligible in practice. Empirical ablations where even token-level importance sampling is dropped confirm little impact on performance (Pang et al., 4 Aug 2025).

TIC-GRPO: Trajectory Importance Correction

TIC-GRPO addresses the bias by replacing token-level importance ratios with a trajectory-level ratio:

$R(\tau^{(i)};\theta,\theta_\text{old}) = \frac{P_\theta(\tau^{(i)})}{P_{\theta_\text{old}}(\tau^{(i)})} = \prod_{t=1}^T \frac{\pi_\theta(a_t^{(i)}|s_{t-1}^{(i)})}{\pi_{\theta_\text{old}}(a_t^{(i)}|s_{t-1}^{(i)})}$

This yields an unbiased policy gradient estimator:

$\nabla J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} [R(\tau;\theta,\theta_\text{old}) \nabla_\theta \log P_\theta(\tau) r(\tau)]$

Unified Pseudocode Structure

Both DS-GRPO and TIC-GRPO operate with the following high-level steps:

Initialize $\theta$ (parameters of the denoiser).
For each outer iteration:
- Set $\theta_\text{old} \leftarrow \theta$
- For each prompt:
  - Sample $G$ trajectories using identical initial noise.
  - Compute terminal rewards and normalize groupwise.
  - For selected timesteps or entire trajectories, compute (token- or trajectory-level) importance weights, clipped.
  - Accumulate gradients and update $\theta$ .

4. Empirical Results and Ablations

Extensive experiments on text-to-image, text-to-video, and image-to-video tasks exhibit strong empirical gains:

Task/Model	Main Metric(s)	Improvement Over Baseline
Text-to-Image (Stable Diff./FLUX/etc.)	HPS-v2.1, CLIP Score	+53% to +177% (HPS); +9% to +16% (CLIP)
Text-to-Video (HunyuanVideo)	Visual-Q., Motion-Q., Align.	Visual-Q. +56%, Motion-Q. +181%
Image-to-Video (SkyReels-I2V)	Motion-Q.	+91%

Ablations provide the following practical guidance:

Timestep Subsampling: Randomly using 60% of timesteps yields optimal stability and convergence.
Noise Level ( $\epsilon_t$ ): $\epsilon_t=0.3$ is optimal; lower collapses reward, higher introduces artifacts.
Best-of-N Inference: Selecting top- $k$ and bottom- $k$ samples from pools accelerates convergence ( $\sim$ 2×), with an increased computational cost.
Sparse/Binary Reward: DS-GRPO robustly learns with discretized (0,1) feedback signals, e.g., via thresholding HPS or CLIP scores.

5. Practical Insights and Stability Mechanisms

DS-GRPO's design integrates several stability and efficiency mechanisms:

Shared Initialization Noise: Using the same $z_T$ for all group samples reduces reward variance and discourages reward hacking, especially in video generation.
Best-of-N Update Focus: Concentrating updates on top and bottom performers enables sharpening of both desired and undesired behavior without complex search or explicit value estimation.
Sparse Feedback Handling: The group-normalized advantage formulation enables effective credit assignment even with highly sparse or binary feedback—a regime where value-based baselines often fail.
No Critic Requirement: The absence of value networks eliminates instability from inaccurate value estimation and reduces implementation complexity.

6. Theoretical Guarantees and Comparison to PPO

Both GRPO and TIC-GRPO satisfy nonconvex convergence rates matching standard on-policy RL algorithms. Specifically, with constant step size $\eta$ , inner loop length $K$ , and group size $|G|$ :

$\frac{1}{N}\sum_{n=1}^N \mathbb{E}\|\nabla J(\theta_{n,0})\|^2 = O(\eta K) + O(1/|G|)$

This holds under mild smoothness and boundedness assumptions on the reward and log-policy. TIC-GRPO achieves an unbiased estimator for $\nabla J(\theta)$ , with empirical results showing faster convergence (Pang et al., 4 Aug 2025).

Relative to PPO:

DS-GRPO methods are critic-free, relying exclusively on group-normalized terminal feedback.
Implementation is simplified; RAM and compute requirements decrease due to the absence of a value network.
Empirical sample efficiency and run-time are competitive with, or surpass, standard PPO approaches.

7. Application Scope and Significance

DS-GRPO serves as a unified reinforcement learning framework for generative policy optimization across both diffusion and flow-based models, with demonstrated efficacy on tasks including text-to-image, text-to-video, and image-to-video generation. It seamlessly adapts across diverse foundation models (e.g., Stable Diffusion, FLUX, HunyuanVideo) and reward models (image/video aesthetic, multimodal alignment, motion quality, and binary rules). Empirical performance extends to +181% improvement on motion quality in video generation tasks and robustness under both continuous and highly sparse reward signals (Xue et al., 12 May 2025). This suggests DS-GRPO is of practical and theoretical importance for advancing RLHF in high-dimensional generative modeling.

PDF Markdown Chat (Pro)

References (2)

DanceGRPO: Unleashing GRPO on Visual Generation (2025)

On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence (2025)

DS-GRPO: Diffusion Group RL Optimization

1. Foundations: GRPO and its Extension to Diffusion Sampling

2. Reinforcement Learning Formulation and Algorithmic Components

MDP Setup

Group-Normalized Advantages

Surrogate Objective

3. Algorithmic Variants: GRPO and TIC-GRPO

GRPO Behavior and Bias

TIC-GRPO: Trajectory Importance Correction

Unified Pseudocode Structure

4. Empirical Results and Ablations

5. Practical Insights and Stability Mechanisms

6. Theoretical Guarantees and Comparison to PPO

7. Application Scope and Significance

Whiteboard

Follow Topic

Continue Learning

DS-GRPO: Diffusion Group RL Optimization

1. Foundations: GRPO and its Extension to Diffusion Sampling

2. Reinforcement Learning Formulation and Algorithmic Components

MDP Setup

Group-Normalized Advantages

Surrogate Objective

3. Algorithmic Variants: GRPO and TIC-GRPO

GRPO Behavior and Bias

TIC-GRPO: Trajectory Importance Correction

Unified Pseudocode Structure

4. Empirical Results and Ablations

5. Practical Insights and Stability Mechanisms

6. Theoretical Guarantees and Comparison to PPO

7. Application Scope and Significance

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics