DS-GRPO: Diffusion Group RL Optimization
- DS-GRPO is a framework that unifies group-normalized, critic-free reinforcement learning with diffusion-based generative sampling for policy optimization.
- It employs innovative group-relative advantage estimation and clipping strategies to stabilize convergence across high-dimensional, ODE/SDE-driven tasks.
- Empirical results in text-to-image and text-to-video tasks demonstrate significant gains, achieving improvements up to +181% over baseline approaches.
DS-GRPO (Diffusion Sampling with Group Relative Policy Optimization) refers to a set of algorithms, notably GRPO and TIC-GRPO, that unify group-normalized, critic-free reinforcement learning for diffusion or flow-based generative sampling and policy optimization. DS-GRPO was motivated by limitations of existing RL paradigms—especially Proximal Policy Optimization (PPO)—when applied to the high-dimensional, ODE/SDE-based sampling processes characteristic of modern generative models, such as those used in visual content generation. The framework enables direct fine-tuning of generative policies using sparse human feedback or reward models, eschewing explicit value networks by employing group-wise reward normalization and advantage estimation. It delivers both theoretical convergence guarantees and substantial empirical gains across multiple tasks and architectures (Xue et al., 12 May 2025, Pang et al., 4 Aug 2025).
1. Foundations: GRPO and its Extension to Diffusion Sampling
Group Relative Policy Optimization (GRPO) was originally proposed for robust policy-gradient fine-tuning in LLMs. The key mechanism involves grouping multiple trajectories by prompt or context, computing within-group normalized advantages, and employing PPO-style clipped policy-ratio updates. In the DS-GRPO variant, this methodology is extended to diffusion and rectified-flow models by reinterpreting the generative denoising process as a Markov Decision Process (MDP): each step of the denoising trajectory becomes an RL timestep, with the generative network acting as the policy (Xue et al., 12 May 2025).
Diffusion samplers, typically described by deterministic ODEs, are recast as reverse SDEs:
where and derive from the noise schedule, and is Brownian motion. The denoising model is interpreted as parameterizing , introducing stochasticity necessary for RL exploration.
2. Reinforcement Learning Formulation and Algorithmic Components
MDP Setup
- States: , with the latent, the current timestep, and the conditioning prompt.
- Actions: , sampled from the policy , approximating the conditional next latent under model and prompt.
- Transition: Deterministic, advancing via and decreasing .
- Rewards: Only at terminal state (), defined by external black-box models (e.g., HPS-v2.1, CLIP Score, VideoAlign).
- Policy: The denoising network (e.g., U-Net or transformer) serves as the RL policy. No explicit value networks are trained; instead, a group-relative baseline is computed.
Group-Normalized Advantages
For a batch of trajectories per prompt:
- Mean reward
- Stddev
- Group-relative advantage
Surrogate Objective
The core update maximizes a clipped, group-normalized advantage:
where .
3. Algorithmic Variants: GRPO and TIC-GRPO
GRPO Behavior and Bias
Original GRPO estimates the policy gradient at (the previous policy's parameters), introducing a minor bias. However, when is updated every steps, this bias is shown to be negligible in practice. Empirical ablations where even token-level importance sampling is dropped confirm little impact on performance (Pang et al., 4 Aug 2025).
TIC-GRPO: Trajectory Importance Correction
TIC-GRPO addresses the bias by replacing token-level importance ratios with a trajectory-level ratio:
This yields an unbiased policy gradient estimator:
Unified Pseudocode Structure
Both DS-GRPO and TIC-GRPO operate with the following high-level steps:
- Initialize (parameters of the denoiser).
- For each outer iteration:
- Set
- For each prompt:
- Sample trajectories using identical initial noise.
- Compute terminal rewards and normalize groupwise.
- For selected timesteps or entire trajectories, compute (token- or trajectory-level) importance weights, clipped.
- Accumulate gradients and update .
4. Empirical Results and Ablations
Extensive experiments on text-to-image, text-to-video, and image-to-video tasks exhibit strong empirical gains:
| Task/Model | Main Metric(s) | Improvement Over Baseline |
|---|---|---|
| Text-to-Image (Stable Diff./FLUX/etc.) | HPS-v2.1, CLIP Score | +53% to +177% (HPS); +9% to +16% (CLIP) |
| Text-to-Video (HunyuanVideo) | Visual-Q., Motion-Q., Align. | Visual-Q. +56%, Motion-Q. +181% |
| Image-to-Video (SkyReels-I2V) | Motion-Q. | +91% |
Ablations provide the following practical guidance:
- Timestep Subsampling: Randomly using 60% of timesteps yields optimal stability and convergence.
- Noise Level (): is optimal; lower collapses reward, higher introduces artifacts.
- Best-of-N Inference: Selecting top- and bottom- samples from pools accelerates convergence (2×), with an increased computational cost.
- Sparse/Binary Reward: DS-GRPO robustly learns with discretized (0,1) feedback signals, e.g., via thresholding HPS or CLIP scores.
5. Practical Insights and Stability Mechanisms
DS-GRPO's design integrates several stability and efficiency mechanisms:
- Shared Initialization Noise: Using the same for all group samples reduces reward variance and discourages reward hacking, especially in video generation.
- Best-of-N Update Focus: Concentrating updates on top and bottom performers enables sharpening of both desired and undesired behavior without complex search or explicit value estimation.
- Sparse Feedback Handling: The group-normalized advantage formulation enables effective credit assignment even with highly sparse or binary feedback—a regime where value-based baselines often fail.
- No Critic Requirement: The absence of value networks eliminates instability from inaccurate value estimation and reduces implementation complexity.
6. Theoretical Guarantees and Comparison to PPO
Both GRPO and TIC-GRPO satisfy nonconvex convergence rates matching standard on-policy RL algorithms. Specifically, with constant step size , inner loop length , and group size :
This holds under mild smoothness and boundedness assumptions on the reward and log-policy. TIC-GRPO achieves an unbiased estimator for , with empirical results showing faster convergence (Pang et al., 4 Aug 2025).
Relative to PPO:
- DS-GRPO methods are critic-free, relying exclusively on group-normalized terminal feedback.
- Implementation is simplified; RAM and compute requirements decrease due to the absence of a value network.
- Empirical sample efficiency and run-time are competitive with, or surpass, standard PPO approaches.
7. Application Scope and Significance
DS-GRPO serves as a unified reinforcement learning framework for generative policy optimization across both diffusion and flow-based models, with demonstrated efficacy on tasks including text-to-image, text-to-video, and image-to-video generation. It seamlessly adapts across diverse foundation models (e.g., Stable Diffusion, FLUX, HunyuanVideo) and reward models (image/video aesthetic, multimodal alignment, motion quality, and binary rules). Empirical performance extends to +181% improvement on motion quality in video generation tasks and robustness under both continuous and highly sparse reward signals (Xue et al., 12 May 2025). This suggests DS-GRPO is of practical and theoretical importance for advancing RLHF in high-dimensional generative modeling.