TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models (2512.08153v1)

Published 9 Dec 2025 in cs.LG, cs.AI, and cs.CV

Abstract: Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.

Abstract PDF Chat (Pro)

Summary

The paper introduces a tree-based RL framework that recasts denoising as a tree search, achieving superior sample efficiency compared to traditional methods.
The method computes per-edge advantages through reward backpropagation, enabling precise, step-wise credit assignment and mitigating uniform reward issues.
Experimental results on SD3.5-medium models show a 2.4x speedup and enhanced aesthetic alignment, establishing a new benchmark for RL post-training.

TreeGRPO: Tree-Structured RL for Efficient Post-Training of Diffusion Models

Introduction and Motivation

The alignment of generative models with human preferences has become a critical focus in visual generation, where diffusion and flow-based models—despite strong pretraining priors—require further fine-tuning for preference and aesthetic alignment. Reinforcement learning (RL) has substantially advanced the alignment of LLMs, and recent adaptation to visual generative models has produced frameworks such as DDPO, DPOK, GRPO, DanceGRPO, and FlowGRPO. However, these methods are hampered by sample inefficiency and coarse credit assignment: each policy update entails full trajectory sampling, and uniformly attributed terminal rewards obscure step-wise action contributions, resulting in limited effectiveness for visual domains.

TreeGRPO addresses these deficiencies by recasting the denoising process as tree search, strategically branching from shared noise and thus enabling concurrent exploration of multiple candidate paths while reusing common prefixes. This framework provides superior sample efficiency, fine-grained credit assignment through reward backpropagation, and computational amortization, achieving better alignment and speed over previous methods.

Methodology

Tree-Advantage Formulation

TreeGRPO’s core insight is the representation of the denoising trajectory as a sparse search tree. At each predefined SDE window along a fixed denoising horizon, branching is introduced by stochastic perturbations, whereas ODE steps proceed deterministically without branching. This structure enables high path diversity with minimal redundant computation. Rewards from final leaf nodes—evaluated via various preference metrics—are backpropagated to compute per-edge advantages, overcoming the uniform attribution of terminal rewards.

Candidate paths are generated from a shared seed and expanded deterministically except at strategic stochastic branching points. For each edge during branching, log-probabilities under a frozen sampler are stored, and summarized advantages are calculated via logprob-based weighted averaging, analogous to Rao-Blackwellization. These per-edge advantages form the basis of GRPO-style policy updates, utilizing PPO-like clipping for stability.

Credit Assignment and Advantage Aggregation

Leaf advantages are aggregated from group-normalized reward scores over one or multiple evaluators (e.g., HPSv2.1, ImageReward, Aesthetic Score, ClipScore). Rewards are weighted and standardized within prompt groups, and the resulting prompt-conditioned advantages serve as boundary conditions for backward propagation. Internal node advantages are assigned via mixture weights derived from the policy behavior log-probabilities, yielding step-specific, prompt-relative advantages for all actions in the denoising process.

This dense credit assignment is fundamental for improving policy optimization, reducing estimator variance, and providing robustness regularization by avoiding reliance on single high-reward noise seeds. The expectation-based update process naturally penalizes sharp reward peaks, favoring smoother, robust action distributions.

Sampling and Training Configuration

TreeGRPO utilizes a random window strategy for SDE branching, typically biased toward early denoising steps via a truncated geometric distribution. Adaptive adjustment of the window parameter r can be performed based on reward progress during training, though the main experiments fix r = 0.5 for balanced performance.

Training occurs under fixed NFE budgets, leveraging parallel sampling on multi-GPU hardware, with prompt batching, AdamW optimization, and consistent random seeds for reproducibility. The method is benchmarked against strong baselines under single and multi-reward training regimes.

Theoretical Analysis

The tree-structured aggregation in TreeGRPO acts as a principled variance reduction mechanism, since the advantage estimator is a probability-weighted sum over multiple future branches from a shared state. Effective sample size directly controls variance reduction, and empirical results confirm improved stability and update strength with increased branching.

Moreover, the weighted averaging process regularizes optimization against noise overfitting, promoting smoothness in local reward landscapes and discouraging collapse to narrow optima. This phenomenon is supported theoretically via Taylor expansion analyses and aligns with the observed superiority of Pareto frontiers in efficiency-reward space.

Experimental Results

TreeGRPO achieves marked improvements in both efficiency and alignment metrics. On the HPDv2 dataset for SD3.5-medium base models, TreeGRPO attains the highest HPSv2.1 and aesthetic scores across all reward models, with iteration times of 72.0–79.2s—2.4x faster than the strongest baselines (DanceGRPO, MixGRPO; 145.4–184.0s).

Single-reward training with HPSv2.1 yields appreciable gains in performance and speed, while multi-reward training (HPSv2.1:ClipScore ratio of 0.8:0.2) maintains strong metrics in all categories and accentuates ImageReward and aesthetic alignment. Ablation studies demonstrate that the optimal tree configuration for efficiency and performance is k = 3, d = 3; increasing branching (k) further improves performance but at higher compute cost, while deeper trees (d) show diminishing returns.

Sampling strategy comparisons reveal that r = 0.5 provides balanced outcomes, with lower r improving aesthetic metrics and higher r favoring text-image alignment. Adaptive sampling can marginally enhance performance, though fixed strategies suffice for robust results.

Multi-reward advantage weighting (0.8:0.2 vs 0.5:0.5) reveals that higher weighting on preference scores better balances overall alignment, whereas equal weighting tends to overfit manner metrics at the expense of primary preference objectives.

Practical and Theoretical Implications

TreeGRPO’s tree-structured RL framework fundamentally improves fine-tuning efficiency and policy credit assignment for diffusion models. By leveraging prefix reuse and strategic branching, it amplifies sample efficiency and enables precise credit propagation, addressing fundamental weaknesses of prior trajectory-based RL post-training. These mechanisms facilitate the alignment of visual generative models with human preference and establish a new standard for the efficiency-reward trade-off in RL-based model alignment.

The approach introduces additional hyperparameters—branching factor, window scheduling, advantage aggregation ratio—whose optimal settings may be context-dependent. Memory overhead during training is higher due to tree storage. Future work will focus on adaptive scheduling for these parameters, early tree pruning via learned value functions, further scaling to more complex domains (video, 3D), and exploring tree-based advantage propagation in other RL post-training frameworks.

Conclusion

TreeGRPO presents a tree-based RL post-training framework that delivers substantial gains in sample efficiency, step-wise credit assignment, and training speed for diffusion and flow-based generative models. By recasting denoising as tree search and exploiting structured reward propagation, TreeGRPO achieves superior alignment with human preferences and establishes robust Pareto-frontier performance. The method’s practical strengths and theoretical regularization properties suggest its utility for future advances in scalable generative model alignment and beyond.

Citation: "TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models" (2512.08153)