Separable GRPO-Based Training (SepGRPO)

Updated 5 January 2026

SepGRPO is a multimodal RL methodology that alternates policy optimization to jointly align MLLMs and DiTs, enabling effective Chain-of-Thought reasoning.
It employs clipped surrogate objectives, KL penalties, and module-specific rewards to decouple training processes and ensure stable, scalable updates.
Empirical benchmarks from GenEval, WISE, and RISEBench demonstrate significant improvements in reasoning generation and visual rendering performance.

The separable GRPO-based training paradigm (SepGRPO) is a reinforcement learning methodology designed for joint alignment of Multimodal LLMs (MLLMs) and Diffusion Transformers (DiTs) within the ThinkGen framework. It employs an alternating policy-gradient approach via Group-Relative Policy Optimization (GRPO), utilizing clipped surrogate objectives with KL penalties and module-specific rewards. SepGRPO enables effective Chain-of-Thought (CoT) reasoning for general visual generation tasks, supporting flexible multi-scenario training while maintaining strict decoupling between the instruction-generating MLLM and the image-generating DiT modules (Jiao et al., 29 Dec 2025).

1. Mathematical Foundation and Policy Representation

SepGRPO formalizes joint RL as alternated optimization over two parameterized policies:

MLLM Policy $\pi_M(o\,|\,q; \theta_M)$ :

Inputs $q$ (captions, reference image plus edit instructions) yield autoregressive “think-phase” token sequences $o=(o_1,\ldots,o_T)$ under parameters $\theta_M$ .

DiT Policy $\pi_D(x_{0:T}\,|\,z,\,c;\,\theta_D)$ :

Generates denoising trajectories $x_{0:T}$ ; latent noise $z\sim \mathcal{N}(0, I)$ ; conditional input $c$ extracted from post-</think> hidden states by VGI-Refine; parameters $\theta_D$ .

Rewards are module-specific:

$R_M(o) := R_{\text{rule}}(\text{DiT}_\text{generate}(o))$ : MLLM tokens are fed to a frozen DiT, resulting images scored by scenario-based models (GenEval, HPSv3, OCR word-accuracy, SigLIP2, NED).
$R_D(x_{0:T}) := R_{\text{rule}}(x_T)$ : DiT rollouts scored directly when MLLM is frozen.

Advantage estimation employs group-relative normalization: $\hat{A}_i = \frac{R_i - \frac{1}{G}\sum_{j=1}^G R_j}{\sqrt{\frac{1}{G}\sum_{j=1}^G (R_j-\bar{R})^2} + \epsilon_{norm}}$ GRPO applies clipped probability ratios $r_{i,t}(\theta)$ at token-step granularity: $r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|o_{i,<t},q)}{\pi_{\theta_{old}}(o_{i,t}|o_{i,<t},q)}$ The surrogate objective is: $\begin{aligned} J_{\rm GRPO}(\theta) = \mathbb{E}_{q,\{o_i\}\sim\pi_{old}} & \left[ \frac{1}{\sum_i |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min(r_{i,t}(\theta)\hat{A}_i,\; \mathrm{clip}(r_{i,t}(\theta),1-\varepsilon,1+\varepsilon)\hat{A}_i) \right. \ & \qquad\quad \left. - \beta D_{KL}\left(\pi_\theta(\cdot|q)\,\Vert\,\pi_{old}(\cdot|q)\right) \right] \end{aligned}$

Parameter updates are blockwise decoupled: $\begin{aligned} \theta_M &\leftarrow \theta_M + \alpha_M \mathbb{E}_{\tau_M \sim \pi_M}\Bigl[ \hat{A}_M(\tau_M) \nabla_{\theta_M} \log \pi_M(\tau_M; q, \theta_M) \Bigr] \ \theta_D &\leftarrow \theta_D + \alpha_D \mathbb{E}_{\tau_D \sim \pi_D}\Bigl[ \hat{A}_D(\tau_D) \nabla_{\theta_D} \log \pi_D(\tau_D; z, c, \theta_D) \Bigr] \end{aligned}$

2. Alternating Training Algorithm and Data Flow

SepGRPO alternates between module-specific RL epochs, described as:

for epoch in 1…N_epochs:
    # Stage 4: MLLM-GRPO
    freeze(θ_D)
    for batch in MLLM_scenarios_loader:
        inputs ← batch.prompts
        trajectories ← []
        for q in inputs:
            for i in 1…N1:
                o_i ← MLLM.sample_chain_of_thought(q; θ_M_old)
                trajectories.append((q, o_i))
        images ← DiT.generate_from_trajectories(trajectories, fixed_noise; θ_D)
        rewards ← rule_models.score(images, trajectories)
        advantages ← compute_group_relative_advantages(rewards)
        L_GRPO ← build_clipped_surrogate_loss(trajectories, advantages)
        θ_M ← θ_M − α_M∇_{θ_M}L_GRPO

    # Stage 5: DiT-GRPO
    freeze(θ_M)
    for batch in DiT_scenarios_loader:
        inputs ← batch.prompts
        o ← MLLM.sample_chain_of_thought(inputs; θ_M)
        c ← VGI_refine_hidden_states(o)
        trajectories ← []
        for i in 1…N2:
            x_i ← DiT.sample_diffusion(c; θ_D_old)
            trajectories.append(x_i)
        rewards ← rule_models.score(trajectories)
        advantages ← compute_group_relative_advantages(rewards)
        L_GRPO ← build_clipped_surrogate_loss(trajectories, advantages)
        θ_D ← θ_D − α_D∇_{θ_D}L_GRPO

Data flow distinctions:

MLLM-GRPO: User inputs prompt autoregressive MLLM token generation up to <\think>; VGI-Refine extracts hidden states; “Prepadding States” are prepended; DiT (frozen) receives these as instructions for sampling images; rewards are computed; only $\theta_M$ updated.
DiT-GRPO: MLLM (frozen) provides CoT rollout and VGI-Refine output; DiT samples multiple denoising trajectories with fixed $\theta_M$ ; rewards computed by rule models; only $\theta_D$ updated.

3. Training Regimen and Hyperparameter Specification

Rollout counts:

$N_1$ (MLLM-GRPO): 8
$N_2$ (DiT-GRPO): 24

Clipping and KL-Penalty:

$\varepsilon$ (clip): 0.2
$\beta$ (KL-penalty): $\approx 0.01$

Learning rates:

$\alpha_M$ (MLLM-GRPO): $1 \times 10^{-5}$ (lower than supervised)
$\alpha_D$ (DiT-GRPO): $5 \times 10^{-5}$

Batch structure:

MLLM batches: 32 prompts × $N_1$
DiT batches: 16 instructions × $N_2$
Optimizer: AdamW (no weight decay, gradient clip 1.0)

Datasets:

MLLM-GRPO: 5K GenEval, 10K reasoning, 3K rendering, 3K editing, 3K reflection
DiT-GRPO: Simple-Scene (GenEval), Text-Rendering (CVTG), $\approx 6$ K samples

Sampling efficiency:

Denoising steps reduced to 20 per sample at $512 \times 512$ px
CFG $=4$ for initial $60\%$ steps

4. Theoretical Analysis and Convergence

Blockwise coordinate ascent in SepGRPO, under standard smoothness and bounded-reward assumptions, ensures monotonic improvement in module-level reward objectives. KL-penalized surrogate objectives are maximized alternately: $J_M(\theta_M; \theta_D), ~ J_D(\theta_D; \theta_M)$ The sequence of updates converges to a stationary point of

$J(\theta_M, \theta_D) = J_M(\theta_M; \theta_D) + J_D(\theta_D; \theta_M)$

as per classical blockwise coordinate ascent theory [Nocedal & Wright, 2006].

Advantages of the separable paradigm:

Module-local rewards: Each policy is optimized using tailored reward signals without cross-module gradient entanglement.
Lower memory footprint: Only a single module’s rollout graph is instantiated during its respective update.
Variance reduction: Sampling distributions remain module-specific (token generation for MLLM; latent denoising for DiT), resulting in more stable gradient estimates.

A plausible implication is that such strict separation mitigates delayed reward propagation and credit assignment difficulties, common in end-to-end multimodal RL pipelines.

5. Empirical Evaluation and Benchmark Performance

Ablation results across GenEval, WISE, CVTG, and RISEBench demonstrate the quantitative benefits of SepGRPO. Summary table:

Task	w/o RL	+MLLM-GRPO*	+DiT-GRPO*
GenEval	0.88	0.86	0.89
WISE	0.55	0.76	0.76
CVTG	0.75	0.79	0.84
RISEBench	3.6	13.0	13.0

(* denotes CoT reasoning enabled)

Notable metrics:

WISE (reasoning generation): +MLLM-GRPO achieves 0.76 vs. 0.55 (w/o RL), a 21% improvement.
RISEBench (reasoning editing): SepGRPO CoT yields 13.0 average vs. 3.6 (w/o CoT), a gain of 9.4.
CVTG text rendering accuracy: Stage 3: 0.75; +MLLM-GRPO: 0.79; +DiT-GRPO: 0.84.

This suggests consistent advantages in reasoning, text rendering, semantic alignment, and editing over supervised-only or non-CoT RL training.

6. Context and Research Significance

SepGRPO represents a principled approach for multimodal generative model alignment under RL, addressing limitations of scenario-specific or coupled update strategies. Its blockwise policy optimization enhances the generalizability of CoT reasoning for both instruction and image generation, validated across multiple diverse datasets and benchmarks.

A plausible implication is that such a modular RL training regime can facilitate scalable adaptation to novel generation scenarios, lower computational overhead during training, and enable more interpretable module-specific improvements (Jiao et al., 29 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

ThinkGen: Generalized Thinking for Visual Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Separable GRPO-based Training Paradigm (SepGRPO).

Separable GRPO-Based Training (SepGRPO)

1. Mathematical Foundation and Policy Representation

2. Alternating Training Algorithm and Data Flow

3. Training Regimen and Hyperparameter Specification

4. Theoretical Analysis and Convergence

5. Empirical Evaluation and Benchmark Performance

6. Context and Research Significance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Separable GRPO-Based Training (SepGRPO)

1. Mathematical Foundation and Policy Representation

2. Alternating Training Algorithm and Data Flow

3. Training Regimen and Hyperparameter Specification

4. Theoretical Analysis and Convergence

5. Empirical Evaluation and Benchmark Performance

6. Context and Research Significance

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research