- The paper introduces a novel method that transforms deterministic ODE sampling into a stochastic SDE framework for effective RL exploration in text-to-image models.
- It employs a denoising reduction strategy, cutting training steps by over 4× while preserving high image quality during inference.
- Empirical evaluations demonstrate significant improvements in compositional image generation, visual text rendering, and human preference alignment against state-of-the-art benchmarks.
Flow-GRPO is a novel method that integrates online reinforcement learning (RL) with flow matching models, specifically targeting text-to-image (T2I) generation tasks. Flow matching models, like those used in Stable Diffusion 3 (SD3), are known for their high-quality image generation via deterministic Ordinary Differential Equation (ODE) solvers. However, they often struggle with complex compositional prompts and accurate text rendering. Furthermore, their deterministic nature and computationally intensive multi-step generation process pose significant challenges for applying online RL methods, which rely on stochastic sampling for exploration and efficient data collection.
To address these challenges, Flow-GRPO proposes two key strategies:
- ODE-to-SDE Conversion: This strategy transforms the deterministic ODE-based sampling process into an equivalent Stochastic Differential Equation (SDE). This conversion is crucial because it introduces stochasticity into the generation trajectory while mathematically preserving the marginal distributions of the original flow model. The SDE formulation allows for stochastic sampling, which is essential for RL exploration. It also enables the computation of the policy's probability p(t−1∣t,), which is required for calculating the probability ratios used in policy gradient algorithms like GRPO. The update rule for the stochastic sampling process is derived using Euler-Maruyama discretization of the transformed SDE:
where ϵ∼N(0,) injects the stochasticity, and σt controls the noise level (parameterized by a). The policy πθ(t−1∣t,) is an isotropic Gaussian distribution, allowing for a closed-form computation of the KL divergence term used in the RL objective.
- Denoising Reduction: Online RL training requires generating many samples (trajectories) to estimate policy gradients. Generating high-quality images usually involves many denoising steps (e.g., 40+). Flow-GRPO finds that for collecting training data, significantly fewer denoising steps can be used (e.g., 10 steps) without degrading the final performance achieved after training. This substantially accelerates the data collection phase during online RL training, improving sampling efficiency by over 4× in experiments. The full number of denoising steps is still used during inference to maintain high image quality.
Flow-GRPO utilizes the Group Relative Policy Optimization (GRPO) algorithm, a memory-efficient alternative to PPO that does not require training a separate value network. GRPO estimates the advantage of a generated sample by comparing its reward to the mean reward of a group of samples generated from the same prompt. The training objective is a clipped policy gradient with a KL divergence penalty between the current policy and the reference policy (the pretrained model), which helps stabilize training and prevent reward hacking.
For practical implementation, Flow-GRPO fine-tunes the base flow matching model (e.g., Stable Diffusion 3.5 Medium) using Low-Rank Adaptation (LoRA) for memory efficiency. Key hyperparameters include the group size (G), noise level (a), KL ratio (β), and the number of training/inference timesteps.
The method was empirically evaluated on three T2I tasks:
- Compositional Image Generation (GenEval): Flow-GRPO significantly boosted the accuracy of SD3.5-M from 63% to 95%, outperforming state-of-the-art models like GPT-4o on complex compositional prompts requiring precise object counting, spatial relations, and attribute binding.
- Visual Text Rendering: Accuracy improved from 59% to 92% on rendering specific text within images.
- Human Preference Alignment: Flow-GRPO successfully aligned the model with human preferences as measured by PickScore.
Crucially, these improvements were achieved while maintaining or improving image quality and diversity metrics (Aesthetic score, DeQA, ImageReward, UnifiedReward) evaluated on the diverse DrawBench benchmark. The paper highlights the importance of the KL constraint in the RL objective for preventing reward hacking, which otherwise could lead to decreased image quality or collapse in visual diversity despite higher task-specific rewards. Ablation studies confirmed that Denoising Reduction provides substantial speedups and that an appropriate noise level in the SDE is vital for effective exploration and performance.
Flow-GRPO also demonstrated generalization capabilities, achieving strong performance on unseen object classes, higher object counts than trained on, and significant gains on the T2I-CompBench++ benchmark, which tests generalization to compositional scenarios different from the training data.
Limitations and future work include extending Flow-GRPO to video generation, which presents challenges in designing effective video-specific reward functions, balancing multiple objectives, and scaling the computationally intensive process.