Papers
Topics
Authors
Recent
2000 character limit reached

DS-GRPO: Diffusion Group RL Optimization

Updated 26 November 2025
  • DS-GRPO is a framework that unifies group-normalized, critic-free reinforcement learning with diffusion-based generative sampling for policy optimization.
  • It employs innovative group-relative advantage estimation and clipping strategies to stabilize convergence across high-dimensional, ODE/SDE-driven tasks.
  • Empirical results in text-to-image and text-to-video tasks demonstrate significant gains, achieving improvements up to +181% over baseline approaches.

DS-GRPO (Diffusion Sampling with Group Relative Policy Optimization) refers to a set of algorithms, notably GRPO and TIC-GRPO, that unify group-normalized, critic-free reinforcement learning for diffusion or flow-based generative sampling and policy optimization. DS-GRPO was motivated by limitations of existing RL paradigms—especially Proximal Policy Optimization (PPO)—when applied to the high-dimensional, ODE/SDE-based sampling processes characteristic of modern generative models, such as those used in visual content generation. The framework enables direct fine-tuning of generative policies using sparse human feedback or reward models, eschewing explicit value networks by employing group-wise reward normalization and advantage estimation. It delivers both theoretical convergence guarantees and substantial empirical gains across multiple tasks and architectures (Xue et al., 12 May 2025, Pang et al., 4 Aug 2025).

1. Foundations: GRPO and its Extension to Diffusion Sampling

Group Relative Policy Optimization (GRPO) was originally proposed for robust policy-gradient fine-tuning in LLMs. The key mechanism involves grouping multiple trajectories by prompt or context, computing within-group normalized advantages, and employing PPO-style clipped policy-ratio updates. In the DS-GRPO variant, this methodology is extended to diffusion and rectified-flow models by reinterpreting the generative denoising process as a Markov Decision Process (MDP): each step of the denoising trajectory becomes an RL timestep, with the generative network acting as the policy (Xue et al., 12 May 2025).

Diffusion samplers, typically described by deterministic ODEs, are recast as reverse SDEs:

dzt=ft(zt)dt+gtdWtd z_t = f_t(z_t) dt + g_t dW_t

where ftf_t and gtg_t derive from the noise schedule, and WtW_t is Brownian motion. The denoising model is interpreted as parameterizing ftf_t, introducing stochasticity necessary for RL exploration.

2. Reinforcement Learning Formulation and Algorithmic Components

MDP Setup

  • States: st=(c,t,zt)s_t = (c, t, z_t), with ztz_t the latent, tt the current timestep, and cc the conditioning prompt.
  • Actions: at=zt1a_t = z_{t-1}, sampled from the policy πθ\pi_\theta, approximating the conditional next latent under model and prompt.
  • Transition: Deterministic, advancing via zt1z_{t-1} and decreasing tt.
  • Rewards: Only at terminal state (t=0t=0), defined by external black-box models (e.g., HPS-v2.1, CLIP Score, VideoAlign).
  • Policy: The denoising network (e.g., U-Net or transformer) serves as the RL policy. No explicit value networks are trained; instead, a group-relative baseline is computed.

Group-Normalized Advantages

For a batch of GG trajectories per prompt:

  • Mean reward μ=1Gi=1Gri\mu = \frac{1}{G}\sum_{i=1}^G r_i
  • Stddev σ=1Gi=1G(riμ)2\sigma = \sqrt{\frac{1}{G}\sum_{i=1}^G (r_i - \mu)^2}
  • Group-relative advantage Aig=riμσA^g_i = \frac{r_i - \mu}{\sigma}

Surrogate Objective

The core update maximizes a clipped, group-normalized advantage:

L(θ)=Eτ[1Gi=1G1Tt=1Tmin{ρt,iAig,clip(ρt,i,1ϵ,1+ϵ)Aig}]L(\theta) = -\mathbb{E}_\tau \left[\frac{1}{G}\sum_{i=1}^G \frac{1}{T}\sum_{t=1}^T \min\{\rho_{t,i}A^g_i, \text{clip}(\rho_{t,i}, 1-\epsilon, 1+\epsilon)A^g_i\}\right]

where ρt,i=πθ(at,ist,i)/πθold(at,ist,i)\rho_{t,i} = \pi_\theta(a_{t,i}|s_{t,i})/\pi_{\theta_\text{old}}(a_{t,i}|s_{t,i}).

3. Algorithmic Variants: GRPO and TIC-GRPO

GRPO Behavior and Bias

Original GRPO estimates the policy gradient at θold\theta_\text{old} (the previous policy's parameters), introducing a minor bias. However, when θold\theta_\text{old} is updated every KK steps, this bias is shown to be negligible in practice. Empirical ablations where even token-level importance sampling is dropped confirm little impact on performance (Pang et al., 4 Aug 2025).

TIC-GRPO: Trajectory Importance Correction

TIC-GRPO addresses the bias by replacing token-level importance ratios with a trajectory-level ratio:

R(τ(i);θ,θold)=Pθ(τ(i))Pθold(τ(i))=t=1Tπθ(at(i)st1(i))πθold(at(i)st1(i))R(\tau^{(i)};\theta,\theta_\text{old}) = \frac{P_\theta(\tau^{(i)})}{P_{\theta_\text{old}}(\tau^{(i)})} = \prod_{t=1}^T \frac{\pi_\theta(a_t^{(i)}|s_{t-1}^{(i)})}{\pi_{\theta_\text{old}}(a_t^{(i)}|s_{t-1}^{(i)})}

This yields an unbiased policy gradient estimator:

J(θ)=Eτπθold[R(τ;θ,θold)θlogPθ(τ)r(τ)]\nabla J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} [R(\tau;\theta,\theta_\text{old}) \nabla_\theta \log P_\theta(\tau) r(\tau)]

Unified Pseudocode Structure

Both DS-GRPO and TIC-GRPO operate with the following high-level steps:

  1. Initialize θ\theta (parameters of the denoiser).
  2. For each outer iteration:
    • Set θoldθ\theta_\text{old} \leftarrow \theta
    • For each prompt:
      • Sample GG trajectories using identical initial noise.
      • Compute terminal rewards and normalize groupwise.
      • For selected timesteps or entire trajectories, compute (token- or trajectory-level) importance weights, clipped.
      • Accumulate gradients and update θ\theta.

4. Empirical Results and Ablations

Extensive experiments on text-to-image, text-to-video, and image-to-video tasks exhibit strong empirical gains:

Task/Model Main Metric(s) Improvement Over Baseline
Text-to-Image (Stable Diff./FLUX/etc.) HPS-v2.1, CLIP Score +53% to +177% (HPS); +9% to +16% (CLIP)
Text-to-Video (HunyuanVideo) Visual-Q., Motion-Q., Align. Visual-Q. +56%, Motion-Q. +181%
Image-to-Video (SkyReels-I2V) Motion-Q. +91%

Ablations provide the following practical guidance:

  • Timestep Subsampling: Randomly using 60% of timesteps yields optimal stability and convergence.
  • Noise Level (ϵt\epsilon_t): ϵt=0.3\epsilon_t=0.3 is optimal; lower collapses reward, higher introduces artifacts.
  • Best-of-N Inference: Selecting top-kk and bottom-kk samples from pools accelerates convergence (\sim2×), with an increased computational cost.
  • Sparse/Binary Reward: DS-GRPO robustly learns with discretized (0,1) feedback signals, e.g., via thresholding HPS or CLIP scores.

5. Practical Insights and Stability Mechanisms

DS-GRPO's design integrates several stability and efficiency mechanisms:

  • Shared Initialization Noise: Using the same zTz_T for all group samples reduces reward variance and discourages reward hacking, especially in video generation.
  • Best-of-N Update Focus: Concentrating updates on top and bottom performers enables sharpening of both desired and undesired behavior without complex search or explicit value estimation.
  • Sparse Feedback Handling: The group-normalized advantage formulation enables effective credit assignment even with highly sparse or binary feedback—a regime where value-based baselines often fail.
  • No Critic Requirement: The absence of value networks eliminates instability from inaccurate value estimation and reduces implementation complexity.

6. Theoretical Guarantees and Comparison to PPO

Both GRPO and TIC-GRPO satisfy nonconvex convergence rates matching standard on-policy RL algorithms. Specifically, with constant step size η\eta, inner loop length KK, and group size G|G|:

1Nn=1NEJ(θn,0)2=O(ηK)+O(1/G)\frac{1}{N}\sum_{n=1}^N \mathbb{E}\|\nabla J(\theta_{n,0})\|^2 = O(\eta K) + O(1/|G|)

This holds under mild smoothness and boundedness assumptions on the reward and log-policy. TIC-GRPO achieves an unbiased estimator for J(θ)\nabla J(\theta), with empirical results showing faster convergence (Pang et al., 4 Aug 2025).

Relative to PPO:

  • DS-GRPO methods are critic-free, relying exclusively on group-normalized terminal feedback.
  • Implementation is simplified; RAM and compute requirements decrease due to the absence of a value network.
  • Empirical sample efficiency and run-time are competitive with, or surpass, standard PPO approaches.

7. Application Scope and Significance

DS-GRPO serves as a unified reinforcement learning framework for generative policy optimization across both diffusion and flow-based models, with demonstrated efficacy on tasks including text-to-image, text-to-video, and image-to-video generation. It seamlessly adapts across diverse foundation models (e.g., Stable Diffusion, FLUX, HunyuanVideo) and reward models (image/video aesthetic, multimodal alignment, motion quality, and binary rules). Empirical performance extends to +181% improvement on motion quality in video generation tasks and robustness under both continuous and highly sparse reward signals (Xue et al., 12 May 2025). This suggests DS-GRPO is of practical and theoretical importance for advancing RLHF in high-dimensional generative modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DS-GRPO.