DanceGRPO: Unified Visual RL Framework
- DanceGRPO is a unified reinforcement learning framework that formalizes the multi-step denoising process as an MDP, enabling alignment of generative outputs with human feedback.
- It integrates diffusion models and rectified flows through a modular algorithmic workflow, employing group-relative normalization to stabilize policy gradients.
- Benchmark results show significant improvements in image and video generation tasks, demonstrating robust performance across diverse generative models and reward criteria.
DanceGRPO is a unified reinforcement learning framework that adapts Group Relative Policy Optimization (GRPO) to modern visual generative paradigms, including diffusion models and rectified flows. The approach generalizes across three core task types—text-to-image, text-to-video, and image-to-video—by framing the multi-step denoising process as a Markov Decision Process (MDP) suitable for RL from human feedback (RLHF). DanceGRPO introduces algorithmic and implementation-level unification over diverse generative models, tasks, and reward models, establishing a robust and scalable methodology for aligning generative outputs with complex, multi-criteria user preferences (Xue et al., 12 May 2025).
1. Mathematical Foundations and GRPO Objective
The denoising process underpinning both diffusion and flow-based models is formalized as an MDP, where states encode conditioning information, temporal context, and sample latents; actions correspond to generative steps; and the policy is realized as a reverse SDE/ODE sampler. DanceGRPO restricts reward assignment exclusively to terminal time, with , and propagates a single advantage signal across all timesteps within a trajectory.
The GRPO surrogate objective extends PPO through group-relative normalization, stabilizing large-scale RLHF for visual generation:
with key components as follows:
- , where ,
- Each may itself be a sum of individual reward components
First-order expansion reveals a group-normalized policy gradient, regularized by clipping to control KL divergence.
2. Unified Algorithmic Workflow
DanceGRPO's implementation is modular, adapting to both diffusion and rectified-flow sampling through a single pseudocode template. For each prompt, the workflow proceeds as follows:
- Sample a batch of prompts.
- For each prompt:
- Sample denoising trajectories sharing an initial noise seed.
- For diffusion: reverse SDE sampling ().
- For rectified flow: inject noise into ODE ().
- Compute reward signals per trajectory.
- Normalize and sum advantages across each reward dimension:
- Randomly subsample timesteps ().
- Accumulate gradients and update parameters.
This process is repeated for all iterations, yielding robust policy updates even in the presence of high-dimensional, multi-reward feedback and large group sizes.
3. Task Adaptation and Foundation Models
DanceGRPO supports three main generative tasks, uniformly handling conditioning and reward fusion:
- Text-to-Image: Utilizes Stable Diffusion v1.4, FLUX (flow-based), and HunyuanVideo-T2I.
- Text-to-Video: Employs HunyuanVideo.
- Image-to-Video: Relies on SkyReels-I2V with image-based conditioning.
Distinct adaptations per task include: employing VideoAlign multi-dimensional reward for videos; restricting motion-quality optimization for image-to-video (where fidelity is determined by the input image); and alternating conditional/unconditional samples for classifier-free guidance models. All tasks inject prompt conditioning directly into the sampler state.
4. Reward Model Integration and Normalization
DanceGRPO leverages up to five reward functions simultaneously, synthesizing them by summing group-normalized advantages rather than raw scores:
- Image Aesthetics: HPS-v2.1
- Text–Image Alignment: CLIP score
- Video Aesthetics and Motion Quality: VideoAlign metrics
- Binary Threshold Rewards: e.g., if HPS > 0.28, else 0
Group-wise normalization avoids reward scale mismatches:
This approach stabilizes training and supports efficient handling of sparse binary feedback, maintaining informative gradients by ensuring nonzero variance in group statistics.
5. Experimental Protocol and Benchmark Results
Extensive empirical validation demonstrates DanceGRPO's scaling and robustness:
- Hardware: 8–64 NVIDIA H800 GPUs
- Datasets: >10k prompts for training, 1k for each evaluation split, 200–240 for human evaluation
- Benchmarks: HPS-v2.1, CLIP score, Pick-a-Pic, GenEval, VideoAlign (multi-dim)
Representative results:
| Task | Baseline → DanceGRPO | % Improvement | Metric |
|---|---|---|---|
| T2I (Stable Diffusion) | HPS: 0.239 → 0.365 | +53% | HPS-v2.1 |
| T2I (Stable Diffusion) | CLIP: 0.363 → 0.395 | +9% | CLIP Score |
| T2I (FLUX) | HPS: 0.304 → 0.372 | +22% | HPS-v2.1 |
| T2I (FLUX) | CLIP: 0.405 → 0.427 | +5% | CLIP Score |
| T2V (HunyuanVideo) | Motion: 1.37 → 3.85 | +181% | VideoAlign |
| I2V (SkyReels-I2V) | Motion Q: baseline→+91% | +91% | -- |
Human evaluation preferred RLHF-tuned outputs ~70% of the time across major task/model combinations.
Stability plots confirm that DanceGRPO maintains smooth reward curves even for video models and sparse binary signals, with competing methods (DDPO) diverging on rectified flows. Best-of-N inference scaling with top/bottom-k sample selection accelerates convergence.
6. Practical Insights and Implementation Considerations
Key observations:
- Stability in RLHF: Sharing initial noise seeds within prompt-groups deters reward hacking; group-normalized advantages mitigate outlier gradients; subsampling timesteps () reduces compute and variance.
- Best-of-N Scaling: Focusing updates on informative extremes of the sample pool speeds training but requires additional sampling. This procedure operates orthogonally to the core algorithm.
- Sparse Binary Feedback: The normalization mechanism retains gradient informativeness even with strict thresholding, supporting ascent on sparse reward landscapes.
- ODE-based Sampler Compatibility: Injecting a calibrated SDE noise term into ODE-based generative samplers restores Markovian stochasticity for RL; this permits direct adaptation of GRPO without restructuring samplers into discrete-step MDPs.
DanceGRPO achieves a single policy-gradient framework for visual generation, harmonizing RLHF objectives with advanced denoising architectures. The methodology is notable for its empirical performance, mathematical rigor, and practical extensibility across contemporary generative pipelines (Xue et al., 12 May 2025).