Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 199 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference (2509.06942v2)

Published 8 Sep 2025 in cs.AI and cs.LG

Abstract: Recent studies have demonstrated the effectiveness of directly aligning diffusion models with human preferences using differentiable reward. However, they exhibit two primary challenges: (1) they rely on multistep denoising with gradient computation for reward scoring, which is computationally expensive, thus restricting optimization to only a few diffusion steps; (2) they often need continuous offline adaptation of reward models in order to achieve desired aesthetic quality, such as photorealism or precise lighting effects. To address the limitation of multistep denoising, we propose Direct-Align, a method that predefines a noise prior to effectively recover original images from any time steps via interpolation, leveraging the equation that diffusion states are interpolations between noise and target images, which effectively avoids over-optimization in late timesteps. Furthermore, we introduce Semantic Relative Preference Optimization (SRPO), in which rewards are formulated as text-conditioned signals. This approach enables online adjustment of rewards in response to positive and negative prompt augmentation, thereby reducing the reliance on offline reward fine-tuning. By fine-tuning the FLUX model with optimized denoising and online reward adjustment, we improve its human-evaluated realism and aesthetic quality by over 3x.

Summary

The paper introduces Direct-Align, which enables efficient reward optimization across all timesteps, avoiding overfitting and reward hacking.
It proposes SRPO, a text-conditioned framework that shapes rewards through relative preference differences for fine-grained aesthetic control.
Experiments demonstrate significant improvements in realism (up to 3.7x) and efficiency (75x faster training) compared to prior RL methods.

Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference

Introduction

This work addresses two central challenges in aligning text-to-image diffusion models with human preferences: (1) the computational and optimization bottlenecks of direct reward-based reinforcement learning (RL) across the full diffusion trajectory, and (2) the inflexibility and bias of reward models, which often require costly offline adaptation to new aesthetic or semantic targets. The authors introduce a two-part solution: Direct-Align, a method for efficient, stable reward-based optimization at any diffusion timestep, and Semantic Relative Preference Optimization (SRPO), a framework for online, text-conditional reward shaping that robustly mitigates reward hacking and enables fine-grained, prompt-driven control.

Methodology

Direct-Align: Full-Trajectory Reward Optimization

Conventional direct reward optimization in diffusion models is limited to late denoising steps due to gradient instability and computational cost. This restriction leads to overfitting and reward hacking, as models exploit reward model biases at the end of the trajectory. Direct-Align circumvents this by leveraging the closed-form relationship between noisy and clean images in the diffusion process:

$\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \sigma_t \epsilon_{gt}$

Given a noisy image $\mathbf{x}_t$ and known noise $\epsilon_{gt}$ , the clean image $\mathbf{x}_0$ can be exactly recovered via interpolation, enabling accurate reward assignment and gradient propagation at any timestep. This approach eliminates the need for iterative denoising and allows for efficient, stable optimization across the entire diffusion trajectory.

Figure 1: Method overview. SRPO combines Direct-Align for full-trajectory optimization and a single reward model with both positive and negative prompt conditioning.

Figure 2: One-step prediction at early timesteps. Direct-Align achieves high-quality reconstructions even with 95% noise, outperforming standard one-step methods.

The method further aggregates rewards across multiple timesteps using a decaying discount factor, which regularizes optimization and reduces late-timestep overfitting.

Semantic Relative Preference Optimization (SRPO)

SRPO reformulates reward signals as text-conditioned preferences, enabling online adjustment via prompt augmentation. Instead of relying on static, potentially biased reward models, SRPO computes the relative difference between rewards for positive and negative prompt augmentations:

$r_{SRP}(\mathbf{x}) = r_1 - r_2 = f_{img}(\mathbf{x})^T \cdot (\mathbf{C}_1 - \mathbf{C}_2)$

where $C_1$ and $C_2$ are text embeddings for desired and undesired attributes, respectively. This formulation penalizes irrelevant directions and aligns optimization with fine-grained, user-specified semantics.

SRPO supports both denoising and inversion directions, allowing for gradient ascent (reward maximization) and descent (penalty propagation) at different timesteps, further enhancing robustness against reward hacking.

Figure 3: Comparison of optimization effects. SRPO penalizes irrelevant reward directions, effectively preventing reward hacking and enhancing image texture.

Experimental Results

Quantitative and Qualitative Evaluation

The authors conduct extensive experiments on the FLUX.1.dev model using the HPDv2 benchmark, comparing SRPO and Direct-Align to state-of-the-art online RL methods (ReFL, DRaFT, DanceGRPO). Evaluation metrics include Aesthetic Predictor 2.5, PickScore, ImageReward, HPSv2.1, GenEval, DeQA, and comprehensive human assessments.

SRPO achieves 3.7x improvement in perceived realism and 3.1x improvement in aesthetic quality over the baseline, with a 75x increase in training efficiency compared to DanceGRPO (10 minutes on 32 H20 GPUs). Notably, SRPO is the only method to substantially improve realism without introducing reward hacking artifacts.

Figure 4: Human evaluation results. SRPO demonstrates significant improvements in aesthetics and realism, with a substantial reduction in AIGC artifacts.

Figure 5: Qualitative comparison. SRPO yields superior realism and detail complexity compared to FLUX and DanceGRPO.

Reward Model Generalization and Robustness

SRPO is evaluated with multiple reward models (CLIP, PickScore, HPSv2.1) and consistently enhances realism and detail complexity. The method is robust to reward model biases and does not exhibit reward hacking, in contrast to prior approaches.

Figure 6: Cross-reward results. SRPO generalizes across different reward models, maintaining high image quality and robustness.

Fine-Grained and Style Control

By conditioning on style-related control words, SRPO enables prompt-driven fine-tuning for attributes such as brightness, artistic style, and realism. The effectiveness of control depends on the reward model's ability to recognize the style terms, with high-frequency words in the reward training set yielding stronger controllability.

Figure 7: SRPO-controlled results for different style words, demonstrating prompt-driven fine-grained control.

Figure 8: Experimental overview. SRPO improves realism, enables enhanced style control, and ablation studies confirm the importance of early timestep optimization and inversion.

Analysis and Ablations

Denoising Efficiency

Direct-Align enables accurate reward-based optimization at early timesteps, where standard one-step methods fail due to noise-induced artifacts. Shorter model-predicted step proportions yield higher final image quality.

Optimization Interval

Training restricted to late timesteps increases reward hacking rates. Early or full-trajectory optimization, as enabled by Direct-Align, mitigates this effect.

Ablation Studies

Removing early timestep optimization or the late-timestep discount in Direct-Align degrades realism and increases vulnerability to reward hacking. Inversion-based regularization further improves robustness.

Implications and Future Directions

This work demonstrates that full-trajectory, reward-based RL is both feasible and highly effective for aligning diffusion models with human preferences, provided that optimization is stabilized and reward signals are regularized via semantic relativity. The SRPO framework enables prompt-driven, fine-grained control without the need for costly reward model retraining or large-scale data collection.

Limitations include reduced controllability for rare or out-of-domain control tokens and limited interpretability due to reliance on latent space similarity. Future work should focus on systematic control strategies, learnable control tokens, and reward models explicitly responsive to prompt structure. The SRPO approach is extensible to other online RL algorithms and modalities.

Conclusion

The combination of Direct-Align and SRPO constitutes a significant advance in the practical alignment of diffusion models with nuanced human preferences. By enabling efficient, robust, and prompt-driven optimization across the full diffusion trajectory, this framework sets a new standard for controllable, high-fidelity text-to-image generation. The methodology is broadly applicable and opens new avenues for research in reward modeling, RL-based generative model alignment, and fine-grained user control in generative AI.