Papers
Topics
Authors
Recent
2000 character limit reached

Separable GRPO-Based Training (SepGRPO)

Updated 5 January 2026
  • SepGRPO is a multimodal RL methodology that alternates policy optimization to jointly align MLLMs and DiTs, enabling effective Chain-of-Thought reasoning.
  • It employs clipped surrogate objectives, KL penalties, and module-specific rewards to decouple training processes and ensure stable, scalable updates.
  • Empirical benchmarks from GenEval, WISE, and RISEBench demonstrate significant improvements in reasoning generation and visual rendering performance.

The separable GRPO-based training paradigm (SepGRPO) is a reinforcement learning methodology designed for joint alignment of Multimodal LLMs (MLLMs) and Diffusion Transformers (DiTs) within the ThinkGen framework. It employs an alternating policy-gradient approach via Group-Relative Policy Optimization (GRPO), utilizing clipped surrogate objectives with KL penalties and module-specific rewards. SepGRPO enables effective Chain-of-Thought (CoT) reasoning for general visual generation tasks, supporting flexible multi-scenario training while maintaining strict decoupling between the instruction-generating MLLM and the image-generating DiT modules (Jiao et al., 29 Dec 2025).

1. Mathematical Foundation and Policy Representation

SepGRPO formalizes joint RL as alternated optimization over two parameterized policies:

  • MLLM Policy πM(oq;θM)\pi_M(o\,|\,q; \theta_M):

Inputs qq (captions, reference image plus edit instructions) yield autoregressive “think-phase” token sequences o=(o1,,oT)o=(o_1,\ldots,o_T) under parameters θM\theta_M.

  • DiT Policy πD(x0:Tz,c;θD)\pi_D(x_{0:T}\,|\,z,\,c;\,\theta_D):

Generates denoising trajectories x0:Tx_{0:T}; latent noise zN(0,I)z\sim \mathcal{N}(0, I); conditional input cc extracted from post-</think> hidden states by VGI-Refine; parameters θD\theta_D.

Rewards are module-specific:

  • RM(o):=Rrule(DiTgenerate(o))R_M(o) := R_{\text{rule}}(\text{DiT}_\text{generate}(o)): MLLM tokens are fed to a frozen DiT, resulting images scored by scenario-based models (GenEval, HPSv3, OCR word-accuracy, SigLIP2, NED).
  • RD(x0:T):=Rrule(xT)R_D(x_{0:T}) := R_{\text{rule}}(x_T): DiT rollouts scored directly when MLLM is frozen.

Advantage estimation employs group-relative normalization: A^i=Ri1Gj=1GRj1Gj=1G(RjRˉ)2+ϵnorm\hat{A}_i = \frac{R_i - \frac{1}{G}\sum_{j=1}^G R_j}{\sqrt{\frac{1}{G}\sum_{j=1}^G (R_j-\bar{R})^2} + \epsilon_{norm}} GRPO applies clipped probability ratios ri,t(θ)r_{i,t}(\theta) at token-step granularity: ri,t(θ)=πθ(oi,toi,<t,q)πθold(oi,toi,<t,q)r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|o_{i,<t},q)}{\pi_{\theta_{old}}(o_{i,t}|o_{i,<t},q)} The surrogate objective is: JGRPO(θ)=Eq,{oi}πold[1ioii=1Gt=1oimin(ri,t(θ)A^i,  clip(ri,t(θ),1ε,1+ε)A^i) βDKL(πθ(q)πold(q))]\begin{aligned} J_{\rm GRPO}(\theta) = \mathbb{E}_{q,\{o_i\}\sim\pi_{old}} & \left[ \frac{1}{\sum_i |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min(r_{i,t}(\theta)\hat{A}_i,\; \mathrm{clip}(r_{i,t}(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i) \right. \ & \qquad\quad \left. - \beta D_{KL}\left(\pi_\theta(\cdot|q)\,\Vert\,\pi_{old}(\cdot|q)\right) \right] \end{aligned}

Parameter updates are blockwise decoupled: θMθM+αMEτMπM[A^M(τM)θMlogπM(τM;q,θM)] θDθD+αDEτDπD[A^D(τD)θDlogπD(τD;z,c,θD)]\begin{aligned} \theta_M &\leftarrow \theta_M + \alpha_M \mathbb{E}_{\tau_M \sim \pi_M}\Bigl[ \hat{A}_M(\tau_M) \nabla_{\theta_M} \log \pi_M(\tau_M; q, \theta_M) \Bigr] \ \theta_D &\leftarrow \theta_D + \alpha_D \mathbb{E}_{\tau_D \sim \pi_D}\Bigl[ \hat{A}_D(\tau_D) \nabla_{\theta_D} \log \pi_D(\tau_D; z, c, \theta_D) \Bigr] \end{aligned}

2. Alternating Training Algorithm and Data Flow

SepGRPO alternates between module-specific RL epochs, described as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
for epoch in 1N_epochs:
    # Stage 4: MLLM-GRPO
    freeze(θ_D)
    for batch in MLLM_scenarios_loader:
        inputs  batch.prompts
        trajectories  []
        for q in inputs:
            for i in 1N1:
                o_i  MLLM.sample_chain_of_thought(q; θ_M_old)
                trajectories.append((q, o_i))
        images  DiT.generate_from_trajectories(trajectories, fixed_noise; θ_D)
        rewards  rule_models.score(images, trajectories)
        advantages  compute_group_relative_advantages(rewards)
        L_GRPO  build_clipped_surrogate_loss(trajectories, advantages)
        θ_M  θ_M  α_M_{θ_M}L_GRPO

    # Stage 5: DiT-GRPO
    freeze(θ_M)
    for batch in DiT_scenarios_loader:
        inputs  batch.prompts
        o  MLLM.sample_chain_of_thought(inputs; θ_M)
        c  VGI_refine_hidden_states(o)
        trajectories  []
        for i in 1N2:
            x_i  DiT.sample_diffusion(c; θ_D_old)
            trajectories.append(x_i)
        rewards  rule_models.score(trajectories)
        advantages  compute_group_relative_advantages(rewards)
        L_GRPO  build_clipped_surrogate_loss(trajectories, advantages)
        θ_D  θ_D  α_D_{θ_D}L_GRPO

Data flow distinctions:

  • MLLM-GRPO: User inputs prompt autoregressive MLLM token generation up to <\think>; VGI-Refine extracts hidden states; “Prepadding States” are prepended; DiT (frozen) receives these as instructions for sampling images; rewards are computed; only θM\theta_M updated.
  • DiT-GRPO: MLLM (frozen) provides CoT rollout and VGI-Refine output; DiT samples multiple denoising trajectories with fixed θM\theta_M; rewards computed by rule models; only θD\theta_D updated.

3. Training Regimen and Hyperparameter Specification

Rollout counts:

  • N1N_1 (MLLM-GRPO): 8
  • N2N_2 (DiT-GRPO): 24

Clipping and KL-Penalty:

  • ε\varepsilon (clip): 0.2
  • β\beta (KL-penalty): 0.01\approx 0.01

Learning rates:

  • αM\alpha_M (MLLM-GRPO): 1×1051 \times 10^{-5} (lower than supervised)
  • αD\alpha_D (DiT-GRPO): 5×1055 \times 10^{-5}

Batch structure:

  • MLLM batches: 32 prompts × N1N_1
  • DiT batches: 16 instructions × N2N_2
  • Optimizer: AdamW (no weight decay, gradient clip 1.0)

Datasets:

  • MLLM-GRPO: 5K GenEval, 10K reasoning, 3K rendering, 3K editing, 3K reflection
  • DiT-GRPO: Simple-Scene (GenEval), Text-Rendering (CVTG), 6\approx 6K samples

Sampling efficiency:

  • Denoising steps reduced to 20 per sample at 512×512512 \times 512 px
  • CFG =4=4 for initial 60%60\% steps

4. Theoretical Analysis and Convergence

Blockwise coordinate ascent in SepGRPO, under standard smoothness and bounded-reward assumptions, ensures monotonic improvement in module-level reward objectives. KL-penalized surrogate objectives are maximized alternately: JM(θM;θD), JD(θD;θM)J_M(\theta_M; \theta_D), ~ J_D(\theta_D; \theta_M) The sequence of updates converges to a stationary point of

J(θM,θD)=JM(θM;θD)+JD(θD;θM)J(\theta_M, \theta_D) = J_M(\theta_M; \theta_D) + J_D(\theta_D; \theta_M)

as per classical blockwise coordinate ascent theory [Nocedal & Wright, 2006].

Advantages of the separable paradigm:

  1. Module-local rewards: Each policy is optimized using tailored reward signals without cross-module gradient entanglement.
  2. Lower memory footprint: Only a single module’s rollout graph is instantiated during its respective update.
  3. Variance reduction: Sampling distributions remain module-specific (token generation for MLLM; latent denoising for DiT), resulting in more stable gradient estimates.

A plausible implication is that such strict separation mitigates delayed reward propagation and credit assignment difficulties, common in end-to-end multimodal RL pipelines.

5. Empirical Evaluation and Benchmark Performance

Ablation results across GenEval, WISE, CVTG, and RISEBench demonstrate the quantitative benefits of SepGRPO. Summary table:

Task w/o RL +MLLM-GRPO* +DiT-GRPO*
GenEval 0.88 0.86 0.89
WISE 0.55 0.76 0.76
CVTG 0.75 0.79 0.84
RISEBench 3.6 13.0 13.0

(* denotes CoT reasoning enabled)

Notable metrics:

  • WISE (reasoning generation): +MLLM-GRPO achieves 0.76 vs. 0.55 (w/o RL), a 21% improvement.
  • RISEBench (reasoning editing): SepGRPO CoT yields 13.0 average vs. 3.6 (w/o CoT), a gain of 9.4.
  • CVTG text rendering accuracy: Stage 3: 0.75; +MLLM-GRPO: 0.79; +DiT-GRPO: 0.84.

This suggests consistent advantages in reasoning, text rendering, semantic alignment, and editing over supervised-only or non-CoT RL training.

6. Context and Research Significance

SepGRPO represents a principled approach for multimodal generative model alignment under RL, addressing limitations of scenario-specific or coupled update strategies. Its blockwise policy optimization enhances the generalizability of CoT reasoning for both instruction and image generation, validated across multiple diverse datasets and benchmarks.

A plausible implication is that such a modular RL training regime can facilitate scalable adaptation to novel generation scenarios, lower computational overhead during training, and enable more interpretable module-specific improvements (Jiao et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Separable GRPO-based Training Paradigm (SepGRPO).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube