Separable GRPO-Based Training (SepGRPO)
- SepGRPO is a multimodal RL methodology that alternates policy optimization to jointly align MLLMs and DiTs, enabling effective Chain-of-Thought reasoning.
- It employs clipped surrogate objectives, KL penalties, and module-specific rewards to decouple training processes and ensure stable, scalable updates.
- Empirical benchmarks from GenEval, WISE, and RISEBench demonstrate significant improvements in reasoning generation and visual rendering performance.
The separable GRPO-based training paradigm (SepGRPO) is a reinforcement learning methodology designed for joint alignment of Multimodal LLMs (MLLMs) and Diffusion Transformers (DiTs) within the ThinkGen framework. It employs an alternating policy-gradient approach via Group-Relative Policy Optimization (GRPO), utilizing clipped surrogate objectives with KL penalties and module-specific rewards. SepGRPO enables effective Chain-of-Thought (CoT) reasoning for general visual generation tasks, supporting flexible multi-scenario training while maintaining strict decoupling between the instruction-generating MLLM and the image-generating DiT modules (Jiao et al., 29 Dec 2025).
1. Mathematical Foundation and Policy Representation
SepGRPO formalizes joint RL as alternated optimization over two parameterized policies:
- MLLM Policy :
Inputs (captions, reference image plus edit instructions) yield autoregressive “think-phase” token sequences under parameters .
- DiT Policy :
Generates denoising trajectories ; latent noise ; conditional input extracted from post-</think> hidden states by VGI-Refine; parameters .
Rewards are module-specific:
- : MLLM tokens are fed to a frozen DiT, resulting images scored by scenario-based models (GenEval, HPSv3, OCR word-accuracy, SigLIP2, NED).
- : DiT rollouts scored directly when MLLM is frozen.
Advantage estimation employs group-relative normalization: GRPO applies clipped probability ratios at token-step granularity: The surrogate objective is:
Parameter updates are blockwise decoupled:
2. Alternating Training Algorithm and Data Flow
SepGRPO alternates between module-specific RL epochs, described as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
for epoch in 1…N_epochs: # Stage 4: MLLM-GRPO freeze(θ_D) for batch in MLLM_scenarios_loader: inputs ← batch.prompts trajectories ← [] for q in inputs: for i in 1…N1: o_i ← MLLM.sample_chain_of_thought(q; θ_M_old) trajectories.append((q, o_i)) images ← DiT.generate_from_trajectories(trajectories, fixed_noise; θ_D) rewards ← rule_models.score(images, trajectories) advantages ← compute_group_relative_advantages(rewards) L_GRPO ← build_clipped_surrogate_loss(trajectories, advantages) θ_M ← θ_M − α_M∇_{θ_M}L_GRPO # Stage 5: DiT-GRPO freeze(θ_M) for batch in DiT_scenarios_loader: inputs ← batch.prompts o ← MLLM.sample_chain_of_thought(inputs; θ_M) c ← VGI_refine_hidden_states(o) trajectories ← [] for i in 1…N2: x_i ← DiT.sample_diffusion(c; θ_D_old) trajectories.append(x_i) rewards ← rule_models.score(trajectories) advantages ← compute_group_relative_advantages(rewards) L_GRPO ← build_clipped_surrogate_loss(trajectories, advantages) θ_D ← θ_D − α_D∇_{θ_D}L_GRPO |
Data flow distinctions:
- MLLM-GRPO: User inputs prompt autoregressive MLLM token generation up to <\think>; VGI-Refine extracts hidden states; “Prepadding States” are prepended; DiT (frozen) receives these as instructions for sampling images; rewards are computed; only updated.
- DiT-GRPO: MLLM (frozen) provides CoT rollout and VGI-Refine output; DiT samples multiple denoising trajectories with fixed ; rewards computed by rule models; only updated.
3. Training Regimen and Hyperparameter Specification
Rollout counts:
- (MLLM-GRPO): 8
- (DiT-GRPO): 24
Clipping and KL-Penalty:
- (clip): 0.2
- (KL-penalty):
Learning rates:
- (MLLM-GRPO): (lower than supervised)
- (DiT-GRPO):
Batch structure:
- MLLM batches: 32 prompts ×
- DiT batches: 16 instructions ×
- Optimizer: AdamW (no weight decay, gradient clip 1.0)
Datasets:
- MLLM-GRPO: 5K GenEval, 10K reasoning, 3K rendering, 3K editing, 3K reflection
- DiT-GRPO: Simple-Scene (GenEval), Text-Rendering (CVTG), K samples
Sampling efficiency:
- Denoising steps reduced to 20 per sample at px
- CFG for initial steps
4. Theoretical Analysis and Convergence
Blockwise coordinate ascent in SepGRPO, under standard smoothness and bounded-reward assumptions, ensures monotonic improvement in module-level reward objectives. KL-penalized surrogate objectives are maximized alternately: The sequence of updates converges to a stationary point of
as per classical blockwise coordinate ascent theory [Nocedal & Wright, 2006].
Advantages of the separable paradigm:
- Module-local rewards: Each policy is optimized using tailored reward signals without cross-module gradient entanglement.
- Lower memory footprint: Only a single module’s rollout graph is instantiated during its respective update.
- Variance reduction: Sampling distributions remain module-specific (token generation for MLLM; latent denoising for DiT), resulting in more stable gradient estimates.
A plausible implication is that such strict separation mitigates delayed reward propagation and credit assignment difficulties, common in end-to-end multimodal RL pipelines.
5. Empirical Evaluation and Benchmark Performance
Ablation results across GenEval, WISE, CVTG, and RISEBench demonstrate the quantitative benefits of SepGRPO. Summary table:
| Task | w/o RL | +MLLM-GRPO* | +DiT-GRPO* |
|---|---|---|---|
| GenEval | 0.88 | 0.86 | 0.89 |
| WISE | 0.55 | 0.76 | 0.76 |
| CVTG | 0.75 | 0.79 | 0.84 |
| RISEBench | 3.6 | 13.0 | 13.0 |
(* denotes CoT reasoning enabled)
Notable metrics:
- WISE (reasoning generation): +MLLM-GRPO achieves 0.76 vs. 0.55 (w/o RL), a 21% improvement.
- RISEBench (reasoning editing): SepGRPO CoT yields 13.0 average vs. 3.6 (w/o CoT), a gain of 9.4.
- CVTG text rendering accuracy: Stage 3: 0.75; +MLLM-GRPO: 0.79; +DiT-GRPO: 0.84.
This suggests consistent advantages in reasoning, text rendering, semantic alignment, and editing over supervised-only or non-CoT RL training.
6. Context and Research Significance
SepGRPO represents a principled approach for multimodal generative model alignment under RL, addressing limitations of scenario-specific or coupled update strategies. Its blockwise policy optimization enhances the generalizability of CoT reasoning for both instruction and image generation, validated across multiple diverse datasets and benchmarks.
A plausible implication is that such a modular RL training regime can facilitate scalable adaptation to novel generation scenarios, lower computational overhead during training, and enable more interpretable module-specific improvements (Jiao et al., 29 Dec 2025).