AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

Published 5 Jun 2026 in cs.CV and cs.LG | (2606.06828v1)

Abstract: Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner's current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy--a factor proven crucial in GRPO for LLMs; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model's proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces AdaGRPO which adapts curriculum filtering to match the model's evolving capability in flow-based generative tasks.
It integrates cross-level advantage fusion to combine local and global signals, yielding unbiased policy updates without relying on value networks.
Extensive experiments confirm improvements in training stability, visual fidelity, and alignment with human feedback over existing GRPO methods.

AdaGRPO: Capability-Aware Adaptive Enhancement for Flow-Based GRPO

Motivation and Problem Identification

The alignment of text-to-image (T2I) generative models with human preferences remains a pivotal challenge in controllable generation. Flow-based models have become state-of-the-art for high-fidelity content synthesis; however, their generative process, even after extensive pretraining, exhibits persistent misalignment with user intents, especially for complex or nuanced prompts. Reinforcement Learning from Human Feedback (RLHF) has become a preferred alignment mechanism, with Group Relative Policy Optimization (GRPO) offering a value-network-free approach that uses intra-group sample comparisons for policy updates.

Despite the recent adaptation of GRPO for flow models, such as Flow-GRPO and DanceGRPO, critical limitations persist. Existing pipelines sample prompts randomly without regard to the learner's evolving capabilities, leading to exposure to either too-easy or too-hard tasks and introducing high variance or under-informative updates. Furthermore, advantage estimation is performed using only intra-group statistics, lacking a global reference and resulting in myopic, locally biased credit assignments that do not capture genuine policy progression. This structural decoupling yields instability and hinders the optimal alignment efficacy of flow-based models.

Figure 1: AdaGRPO consistently improves generation quality, especially for intricate textures and compositional fidelity, over baseline Flow-GRPO implementations.

AdaGRPO: Methodological Advancements

AdaGRPO introduces two orthogonal, lightweight components that remedy both prompt selection and advantage estimation:

Online Curriculum Filtering Strategy

Rather than random prompt selection, AdaGRPO incorporates a curriculum-inspired, capability-aware filtering mechanism. At each iteration, a batch of candidate prompts is scored via deterministic ODE rollouts, and the model's proficiency is dynamically anchored by an exponential moving average (EMA) of historical rewards. The prompt whose reward most closely matches this global capability anchor is selected, ensuring that the task presented is at the model's evolving learning boundary. This approach stabilizes reward variance and mitigates the inefficiency of local median heuristics, which can be biased by the candidate batch distribution.

Figure 2: Online Curriculum Filtering dynamically identifies tasks matched to the global capability anchor, avoiding the pitfalls of local-only or extreme sampling.

Cross-Level Advantage Fusion

To resolve the myopia of intra-group comparison, AdaGRPO introduces a dual-stage estimation process. The local advantage is computed within the group, while the global advantage leverages the EMA baseline to gauge absolute progression. To reconcile these signals, AdaGRPO performs a conditional, sign-preserving normalization ensuring a zero-mean global advantage while preserving progression directionality. The fused advantage aggregates both local and global signals before policy update, yielding unbiased and robust gradient estimation aligned with true model enhancements.

Figure 3: AdaGRPO’s pipeline: curriculum filtering evaluates candidate prompts against the EMA anchor; Cross-Level Advantage Fusion integrates unbiased global and local advantages for robust policy optimization.

Experimental Validation

Extensive experimentation demonstrates the architecture-agnostic applicability and efficacy of AdaGRPO:

Quantitative benchmarks on Flux.1-dev and three flow-based GRPO frameworks (Flow-GRPO, DanceGRPO, Flow-CPS) consistently show substantial performance improvements. AdaGRPO yields the best or near-best results on HPS-v2/v3, Coherence, Style, and ImageReward metrics under both single- and multi-reward configurations.
Training reward curves indicate smoother learning dynamics and higher performance ceilings, with AdaGRPO improving stability and convergence speed (see Figure 4).
Figure 4: AdaGRPO stabilizes and accelerates training reward dynamics across paradigms.
Qualitative comparison (Figures 5–6) further corroborates that AdaGRPO achieves superior visual fidelity, semantic adherence, and compositional alignment compared to GRPO baselines. Improvements manifest in natural texture rendering, complex compositional understanding (e.g., accurate spatial relations with mirrors, action scenes), and intricate detailing (e.g., material reflections, fine lighting gradients).
Figure 5: AdaGRPO delivers improved image quality and prompt adherence versus strong baselines (HPS-v2).

Figure 6: On challenging HPS-v3 prompts, AdaGRPO produces richer details and robust alignment.
Ablation studies confirm that both the adaptive curriculum filter and cross-level fusion are critical for optimal performance. The curriculum filter's batch size and momentum coefficient control the trade-off between responsiveness and historical stability, while cross-level advantage fusion is necessary to avoid local-optimum traps.

Theoretical and Practical Implications

AdaGRPO is the first work to systematically introduce dynamic data selection, based on curriculum learning principles, to flow-based GRPO alignment. The explicit modeling of current learner capability allows for optimally challenging tasks, drawing from pedagogical concepts such as the "zone of proximal development." This context-sensitive data selection can be extended to other RL-aligned vision or multimodal tasks, especially where training resource constraints and stability are essential.

Integrating cross-level advantage calibration addresses underexplored issues in policy gradient estimation for value-network-free RL frameworks, indicating broader applicability in alignment protocols beyond vision or T2I. The practical implication is that AdaGRPO can be plugged into any GRPO-capable framework with minimal code changes, immediately improving preference alignment, sample quality, and robustness of training.

Future Directions

The work indicates two promising research threads:

Efficient and scalable curriculum filtering, for example by leveraging quantized lightweight models during prompt selection phase to further reduce overhead.
Extending the cross-level advantage approach to RL frameworks with more complex, multidimensional reward landscapes or multi-modal context, where local and global reward statistics vary on separate factors.

The main limitation is the added computational cost of performing ODE-based profiling for each candidate batch, though empirical evidence suggests that with moderate batch sizes and parallelization, the overhead is manageable.

Conclusion

AdaGRPO delivers a principled, capability-aware RL enhancement for flow-based GRPO, resolving both prompt selection and advantage estimation deficiencies inherent in earlier frameworks. Its curriculum-inspired filter and unbiased cross-level fusion yield consistent improvements in alignment stability, sample quality, and applicability across GRPO architectures. AdaGRPO establishes new standards for data-aware RL alignment in visual generation and motivates broader exploration of curriculum-based adaptive data selection and multi-level policy evaluation for RL with human feedback (2606.06828).

Markdown Report Issue