Papers
Topics
Authors
Recent
2000 character limit reached

MixGRPO/MixGRPO-Flash: Efficient RLHF Diffusion Models

Updated 1 January 2026
  • The paper's main contribution is the introduction of a mixed ODE-SDE sampling strategy with a sliding window to reduce computational overhead while maintaining or improving alignment quality.
  • MixGRPO is defined as an efficient reinforcement learning framework for text-to-image diffusion models that optimizes only a subset of denoising steps using a PPO-based approach.
  • MixGRPO-Flash further accelerates training by integrating high-order ODE solvers outside the optimization window, achieving up to a 71% reduction in training time without significant performance loss.

MixGRPO and its accelerated variant MixGRPO-Flash are frameworks for efficient reinforcement learning from human preferences in text-to-image diffusion models. MixGRPO leverages a mixed ODESDE sampling strategy with a sliding window to optimize only a subset of denoising steps, substantially reducing computational overhead while improving or maintaining alignment quality compared to prior group relative policy optimization (GRPO) methods. MixGRPO-Flash further accelerates training by employing higher-order ODE solvers outside the windowed optimization region. These innovations target the inefficiency of earlier approaches that require sampling and optimizing over all timesteps, exploiting the interchangeable use of SDE and ODE routes in diffusion-based models for preference-based fine-tuning (Li et al., 29 Jul 2025, Sheng et al., 12 Oct 2025).

1. Mathematical Formulation and Mixed Sampling

MixGRPO builds upon flow-matching and score-based diffusion models where the data dynamics admit both stochastic (SDE) and deterministic (ODE) representations—typically referred to as the probability flow ODE and the equivalent forward SDE. Given a diffusion trajectory {xt}t=0T\{x_t\}_{t=0}^{T}, the model defines dynamics:

  • SDE: dxt=f(xt,t)dt+g(t)dWtd x_t = f(x_t, t) dt + g(t) dW_t
  • ODE: x˙t=f(xt,t)\dot{x}_t = f(x_t, t)

In flow-matching, the vector field vθ(xt,t)v_\theta(x_t, t) parametrizes the drift. MixGRPO discretizes sampling as follows:

xt+1={xt+[vθ(xt,t)+σt22t(xt+(1t)vθ(xt,t))]Δt+σtΔtϵt,tS xt+vθ(xt,t)Δt,tSx_{t+1} = \begin{cases} x_t + [v_\theta(x_t, t) + \frac{\sigma_t^2}{2t}(x_t+(1-t)v_\theta(x_t,t))] \Delta t + \sigma_t\sqrt{\Delta t}\,\epsilon_t, & t \in S \ x_t + v_\theta(x_t, t)\,\Delta t, & t \notin S \end{cases}

with ϵtN(0,I)\epsilon_t \sim \mathcal{N}(0,I) and SS the set of steps within the current sliding window.

The policy-gradient objective, inherited from GRPO, is computed only over the windowed steps using a PPO-style clipped surrogate:

JmixGRPO(θ)=Ec,xTiπθold[1Ni=1N1StSmin(rti(θ)Ai,clip(rti(θ),1ε,1+ε)Ai)]\mathcal{J}_{\mathrm{mixGRPO}}(\theta) = \mathbb{E}_{c, x_T^i \sim \pi_{\theta_{\mathrm{old}}}}\left[ \frac{1}{N} \sum_{i=1}^N \frac{1}{|S|} \sum_{t\in S} \min\Big(r_t^i(\theta)A^i, \operatorname{clip}(r_t^i(\theta),1-\varepsilon,1+\varepsilon) A^i\Big) \right]

with rti(θ)r_t^i(\theta) as the per-action likelihood ratio and AiA^i the normalized terminal advantage.

2. Sliding-Window Optimization and Algorithmic Workflow

The MixGRPO training procedure introduces a sliding window W(l)W(l) of size ww over the TT denoising steps, only applying stochastic SDE-based policies and policy optimization within this window. At each training iteration, the window is positioned at ll, and after a fixed number of iterations (τ\tau) it shifts by stride ss. The rest of the trajectory outside W(l)W(l) employs the deterministic ODE, allowing optimization focus and noise injection on a limited region.

The algorithm proceeds as:

  1. For each prompt and sample, generate a trajectory applying SDE or ODE steps depending on window membership.
  2. Compute terminal rewards and normalized advantages.
  3. Policy gradient updates are confined to windowed steps.
  4. After every τ\tau iterations, the window position ll is incremented (or follows an exponential shift schedule).

This approach constrains the number of timesteps requiring full stochastic rollout and backward computation per iteration, reducing the gradient accumulation burden from order TT to ww (Li et al., 29 Jul 2025).

3. MixGRPO-Flash: High-Order ODE Compression

MixGRPO-Flash further accelerates the base MixGRPO by employing high-order ODE solvers (e.g., DPM-Solver++ 2nd-order midpoint) outside the sliding window. Specifically:

  • Before the window: first-order Euler ODE integration.
  • Within the window: standard SDE updates.
  • After the window: ODE integration with high-order solvers, compressing (Tlw)(T-l-w) steps by a factor rr.

The effective step count is: T~=l+w+(Tlw)r\tilde{T} = l + w + \lceil (T-l-w) r \rceil This partitioning dramatically reduces the number of forward passes through the model, leveraging the absence of optimization needs and stochasticity outside the window (Li et al., 29 Jul 2025). The Flash variant yields empirical speedups of 71%\sim71\%.

4. Sampler Stochasticity, gDDIM, and Reward Gap Analysis

MixGRPO exploits the generalized denoising diffusion implicit model (gDDIM) sampler, which supports arbitrary interpolation between deterministic (ODE/DDIM, η=0\eta=0) and stochastic (SDE/DDPM, η=1\eta=1) dynamics per time step. MixGRPO alternates or slides between these regimes to facilitate exploration and efficient optimization (Sheng et al., 12 Oct 2025).

Theoretical analysis of reward gaps—i.e., the difference in expected reward when using SDE versus ODE sampling—reveals distinct behaviors in this framework:

  • For MixGRPO, JSDE>JODEJ_{\mathrm{SDE}} > J_{\mathrm{ODE}}, in contrast to DDPO's JODE>JSDEJ_{\mathrm{ODE}} > J_{\mathrm{SDE}} (where JJ indicates expected alignment reward at final time).
  • Empirically, the SDE–ODE reward gap decreases with training, and ODE-generated samples after SDE-based fine-tuning attain high human preference scores.
  • The use of gDDIM guarantees preservation of training-time marginals, justifying evaluation under ODE (deterministic) sampling at test time (Sheng et al., 12 Oct 2025).

5. Experimental Results and Comparative Metrics

MixGRPO and MixGRPO-Flash were benchmarked against prior work, including DanceGRPO and baseline Flux models, using multi-reward human preference metrics (HPS-v2.1, Pick Score, ImageReward, Unified):

Method NFEπold_{\pi_{\mathrm{old}}} NFEπθ_{\pi_\theta} Time (s) ↓ HPS-v2.1 Pick Score ImageReward Unified
DanceGRPO (std) 25 14 291.28 0.356 0.233 1.436 3.397
DanceGRPO (NFE=4) 25 4 149.98 0.334 0.225 1.335 3.374
MixGRPO 25 4 150.84 0.367↑ 0.237↑ 1.629↑ 3.418↑
MixGRPO-Flash 16 (avg) 4 112.37 0.358↑ 0.236↑ 1.528↑ 3.407↑
MixGRPO-Flash* 8 4* 83.28 0.357↑ 0.232↑ 1.624↑ 3.402↑

MixGRPO achieved approximately 50% reduction in training time over DanceGRPO with improved reward metrics. MixGRPO-Flash* (window frozen) reduced time by ~71% with negligible performance degradation (Li et al., 29 Jul 2025). In single- and two-reward settings, MixGRPO consistently improved in both in-domain and out-of-domain reward metrics.

6. Implementation and Empirical Protocols

Key implementation details included:

  • Denoising step count T=25T=25; sliding-window size w=4w=4; shift interval τ=25\tau=25 or decaying exponentially; stride s=1s=1.
  • Noise schedule: t~=t/(1(s~1)t)\tilde{t} = t/(1-(\tilde{s}-1)t), s~=3\tilde{s}=3, σt=ηt~/(1t~)\sigma_t = \eta\sqrt{\tilde{t}/(1-\tilde{t})} with η=0.7\eta=0.7.
  • Sampling: N=12N=12 images per prompt per iteration, advantage clipping to [5,5][-5,5], gradient accumulation (3 micro-batches, 4 updates per iteration).
  • Optimization: AdamW (1e51\mathrm{e}{-5} lr, 1e41\mathrm{e}{-4} weight decay), mixed-precision training (bf16 + fp32 masters), 32 × A100 GPUs, batch size 1 per GPU.
  • High-order ODE solver: DPM-Solver++ (2nd-order midpoint) for ODE compression (Li et al., 29 Jul 2025).
  • In MixGRPO experiments reported in (Sheng et al., 12 Oct 2025): T=15, group size G=12G=12, PPO clip ϵ=1e4\epsilon=1\mathrm{e}{-4}, 12 gradient updates per iteration, trained for 25 iterations.

Empirical findings confirmed that reward gaps between stochastic training and deterministic inference collapsed during training and that ODE sampler outputs post-finetuning via MixGRPO achieved high human preference scores.

7. Relations to Other Methods and Theoretical Context

MixGRPO is positioned as a direct evolution from FlowGRPO and DanceGRPO, offering greater optimization efficiency by eliminating the need for optimization over all denoising steps. The mix of ODE and SDE steps, controlled via a sliding window and realized through the gDDIM framework, presents a principled mechanism for balancing exploration (via SDE) and efficiency/stability (via ODE).

Theoretical analysis in (Sheng et al., 12 Oct 2025) emphasizes the role of sampler-type stochasticity in RLHF for diffusion models, showing the critical tradeoff between exploration and reward alignment. The sliding-window hybrid policy in MixGRPO leverages this understanding for practical efficiency without sacrificing alignment quality.

No formal description or analysis of MixGRPO-Flash appears in (Sheng et al., 12 Oct 2025), which focuses its attention on the fundamental MixGRPO approach. All reported algorithmic or empirical findings for MixGRPO-Flash derive from (Li et al., 29 Jul 2025).


References:

"MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE" (Li et al., 29 Jul 2025) "Understanding Sampler Stochasticity in Training Diffusion Models for RLHF" (Sheng et al., 12 Oct 2025)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MixGRPO/MixGRPO-Flash.