Papers
Topics
Authors
Recent
2000 character limit reached

DanceGRPO: Unified Visual RL Framework

Updated 16 January 2026
  • DanceGRPO is a unified reinforcement learning framework that formalizes the multi-step denoising process as an MDP, enabling alignment of generative outputs with human feedback.
  • It integrates diffusion models and rectified flows through a modular algorithmic workflow, employing group-relative normalization to stabilize policy gradients.
  • Benchmark results show significant improvements in image and video generation tasks, demonstrating robust performance across diverse generative models and reward criteria.

DanceGRPO is a unified reinforcement learning framework that adapts Group Relative Policy Optimization (GRPO) to modern visual generative paradigms, including diffusion models and rectified flows. The approach generalizes across three core task types—text-to-image, text-to-video, and image-to-video—by framing the multi-step denoising process as a Markov Decision Process (MDP) suitable for RL from human feedback (RLHF). DanceGRPO introduces algorithmic and implementation-level unification over diverse generative models, tasks, and reward models, establishing a robust and scalable methodology for aligning generative outputs with complex, multi-criteria user preferences (Xue et al., 12 May 2025).

1. Mathematical Foundations and GRPO Objective

The denoising process underpinning both diffusion and flow-based models is formalized as an MDP, where states st=(c,t,zt)s_t=(c, t, z_t) encode conditioning information, temporal context, and sample latents; actions at=zt1a_t=z_{t-1} correspond to generative steps; and the policy πθ(atst)\pi_\theta(a_t|s_t) is realized as a reverse SDE/ODE sampler. DanceGRPO restricts reward assignment exclusively to terminal time, with R(s0,a0)=r(z0,c)R(s_0,a_0)=r(z_0,c), and propagates a single advantage signal across all timesteps within a trajectory.

The GRPO surrogate objective extends PPO through group-relative normalization, stabilizing large-scale RLHF for visual generation:

J(θ)=E{τi}i=1Gπθold[1Gi=1G1Tt=1Tmin(ρt,iAi,  clip(ρt,i,1ϵ,1+ϵ)Ai)]\mathcal{J}(\theta) = \mathbb{E}_{\{\tau_i\}_{i=1}^G \sim \pi_{\theta_{\rm old}}} \Bigg[ \frac{1}{G}\sum_{i=1}^G \frac{1}{T}\sum_{t=1}^T \min\Bigl( \rho_{t,i}A_i,\;\mathrm{clip}(\rho_{t,i},1-\epsilon,1+\epsilon)A_i \Bigr) \Bigg]

with key components as follows:

  • ρt,i=πθ(at,ist,i)πθold(at,ist,i)\rho_{t,i}=\frac{\pi_\theta(a_{t,i}|s_{t,i})}{\pi_{\theta_{\rm old}}(a_{t,i}|s_{t,i})}
  • Ai=riμrσrA_i=\frac{r_i-\mu_r}{\sigma_r}, where μr=1Gjrj\mu_r=\tfrac{1}{G}\sum_j r_j, σr=Std({rj})\sigma_r=\mathrm{Std}(\{r_j\})
  • Each ri=r(z0(i),c)r_i = r(z_0^{(i)},c) may itself be a sum of KK individual reward components

First-order expansion reveals a group-normalized policy gradient, regularized by clipping to control KL divergence.

2. Unified Algorithmic Workflow

DanceGRPO's implementation is modular, adapting to both diffusion and rectified-flow sampling through a single pseudocode template. For each prompt, the workflow proceeds as follows:

  1. Sample a batch of prompts.
  2. For each prompt:

    • Sample GG denoising trajectories sharing an initial noise seed.
    • For diffusion: reverse SDE sampling (dzt=(ftzt1+ηt22gt2logpt(zt))dt+ηtgtdw\mathrm{d}z_t=(f_tz_t-\tfrac{1+\eta_t^2}{2}g_t^2\nabla\log p_t(z_t))\,\mathrm{d}t + \eta_tg_t\,\mathrm{d}w).
    • For rectified flow: inject noise into ODE (dzt=(ut12εt2logpt(zt))dt+εtdw\mathrm{d}z_t=(u_t - \tfrac{1}{2}\varepsilon_t^2\nabla\log p_t(z_t))\,\mathrm{d}t + \varepsilon_t\,\mathrm{d}w).
    • Compute KK reward signals per trajectory.
    • Normalize and sum advantages across each reward dimension:

    Ai=k=1KrikμkσkA_i = \sum_{k=1}^K \frac{r_i^k - \mu^k}{\sigma^k}

  • Randomly subsample timesteps (τT\tau T).
  • Accumulate gradients and update parameters.

This process is repeated for all iterations, yielding robust policy updates even in the presence of high-dimensional, multi-reward feedback and large group sizes.

3. Task Adaptation and Foundation Models

DanceGRPO supports three main generative tasks, uniformly handling conditioning and reward fusion:

  • Text-to-Image: Utilizes Stable Diffusion v1.4, FLUX (flow-based), and HunyuanVideo-T2I.
  • Text-to-Video: Employs HunyuanVideo.
  • Image-to-Video: Relies on SkyReels-I2V with image-based conditioning.

Distinct adaptations per task include: employing VideoAlign multi-dimensional reward for videos; restricting motion-quality optimization for image-to-video (where fidelity is determined by the input image); and alternating conditional/unconditional samples for classifier-free guidance models. All tasks inject prompt conditioning cc directly into the sampler state.

4. Reward Model Integration and Normalization

DanceGRPO leverages up to five reward functions simultaneously, synthesizing them by summing group-normalized advantages rather than raw scores:

  • Image Aesthetics: HPS-v2.1
  • Text–Image Alignment: CLIP score
  • Video Aesthetics and Motion Quality: VideoAlign metrics
  • Binary Threshold Rewards: e.g., r=1r = 1 if HPS > 0.28, else 0

Group-wise normalization avoids reward scale mismatches:

Ai=k=1KrikμkσkA_i = \sum_{k=1}^K \frac{r_i^k - \mu^k}{\sigma^k}

This approach stabilizes training and supports efficient handling of sparse binary feedback, maintaining informative gradients by ensuring nonzero variance in group statistics.

5. Experimental Protocol and Benchmark Results

Extensive empirical validation demonstrates DanceGRPO's scaling and robustness:

  • Hardware: 8–64 NVIDIA H800 GPUs
  • Datasets: >10k prompts for training, 1k for each evaluation split, 200–240 for human evaluation
  • Benchmarks: HPS-v2.1, CLIP score, Pick-a-Pic, GenEval, VideoAlign (multi-dim)

Representative results:

Task Baseline → DanceGRPO % Improvement Metric
T2I (Stable Diffusion) HPS: 0.239 → 0.365 +53% HPS-v2.1
T2I (Stable Diffusion) CLIP: 0.363 → 0.395 +9% CLIP Score
T2I (FLUX) HPS: 0.304 → 0.372 +22% HPS-v2.1
T2I (FLUX) CLIP: 0.405 → 0.427 +5% CLIP Score
T2V (HunyuanVideo) Motion: 1.37 → 3.85 +181% VideoAlign
I2V (SkyReels-I2V) Motion Q: baseline→+91% +91% --

Human evaluation preferred RLHF-tuned outputs ~70% of the time across major task/model combinations.

Stability plots confirm that DanceGRPO maintains smooth reward curves even for video models and sparse binary signals, with competing methods (DDPO) diverging on rectified flows. Best-of-N inference scaling with top/bottom-k sample selection accelerates convergence.

6. Practical Insights and Implementation Considerations

Key observations:

  • Stability in RLHF: Sharing initial noise seeds within prompt-groups deters reward hacking; group-normalized advantages mitigate outlier gradients; subsampling timesteps (τ0.6\tau \approx 0.6) reduces compute and variance.
  • Best-of-N Scaling: Focusing updates on informative extremes of the sample pool speeds training but requires additional sampling. This procedure operates orthogonally to the core algorithm.
  • Sparse Binary Feedback: The normalization mechanism retains gradient informativeness even with strict thresholding, supporting ascent on sparse reward landscapes.
  • ODE-based Sampler Compatibility: Injecting a calibrated SDE noise term into ODE-based generative samplers restores Markovian stochasticity for RL; this permits direct adaptation of GRPO without restructuring samplers into discrete-step MDPs.

DanceGRPO achieves a single policy-gradient framework for visual generation, harmonizing RLHF objectives with advanced denoising architectures. The methodology is notable for its empirical performance, mathematical rigor, and practical extensibility across contemporary generative pipelines (Xue et al., 12 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DanceGRPO.