Best general method to retain variation during RL post-training

Determine an effective general method for retaining variation (diversity) in generated images during reinforcement learning post-training of diffusion-based text-to-image models, given that classifier-free guidance can reduce diversity and Kullback–Leibler (KL) regularization is commonly used but the optimal strategy for preserving variation is unknown.

Background

The paper proposes Finite Difference Flow Optimization (FDFO) as an RL post-training method for diffusion-based image generators and compares it to Flow-GRPO. While the authors primarily disable KL regularization in main experiments to isolate algorithmic effects, they note that KL regularization is typically used in practice to help preserve diversity of outputs during post-training.

They observe that diversity often decreases as rewards focused on quality and alignment are optimized, and that while KL regularization is compatible with their method, the principled, generally best approach to preserve variation remains unresolved.

References

KL regularization is typically used to better retain variation in the results, but the best way to do this in general remains an open problem.

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models  (2603.12893 - McAllister et al., 13 Mar 2026) in Section 6, Discussion and Future Work