Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 130 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 76 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

Diffusion-DRO: Denoising Ranking Optimization

Updated 28 October 2025
  • Diffusion-DRO is a preference optimization framework that redefines alignment by casting it as a max-margin denoising ranking problem, eliminating explicit reward models.
  • It leverages inverse reinforcement learning principles and trajectory-level noise prediction errors to separate expert demonstrations from policy samples.
  • Empirical evaluations show 70–80% win rates against baselines, demonstrating robust performance and scalability even with limited expert demonstration data.

Diffusion Denoising Ranking Optimization (Diffusion-DRO) is a preference optimization framework for aligning diffusion model outputs with implicit human feedback, grounded in principles from inverse reinforcement learning. It removes reliance on explicit reward models and pairwise preference data by casting the learning objective as a denoising-based ranking problem, fundamentally restructuring how the alignment objective is expressed and solved within the diffusion modeling paradigm.

1. Background and Motivation

Diffusion models have demonstrated state-of-the-art performance across text-to-image generation tasks but present challenges for preference learning due to three interrelated issues: (i) the non-linear preference probability structure induced by DPO-style sigmoid objectives, which is poorly suited to the regression-based losses standard in denoising diffusion training, (ii) a practical dependence on paired training samples—often laboriously curated or prone to semantic bias, and (iii) insufficient generalization when models are fine-tuned on limited or narrowly distributed data. Diffusion-DRO directly addresses these bottlenecks by reframing the preference learning problem through an inverse reinforcement learning lens.

2. Theoretical Foundations: Max-Margin Inverse Reinforcement Learning

Diffusion-DRO’s objective is formulated as a max-margin ranking task over expert demonstrations and online policy samples. Unlike probabilistic approaches that use cross-entropy or Bradley-Terry models, this method leverages the margin between the expert and policy noise prediction losses for each denoising step. Formally, for prompt c\bm{c}, with expert demonstration set D(c)\mathcal{D}(\bm{c}) and policy pθ(x0c)p_\theta(\bm{x}_0|\bm{c}), the objective is

Ec,xˉ0D(c)[r(xˉ0,c)]Ec,x0pθ[r(x0,c)]\mathbb{E}_{\bm{c}, \bar{\bm{x}}_0 \sim \mathcal{D}(\bm{c})} \left[ r(\bar{\bm{x}}_0, \bm{c}) \right] \geq \mathbb{E}_{\bm{c}, \bm{x}_0 \sim p_\theta} \left[ r(\bm{x}_0, \bm{c}) \right]

where rr is the implicit utility function. The optimal policy assumes the form

pθ(x0c)pθref(x0c)exp(r(c,x0)/β)p^*_\theta(\bm{x}_0|\bm{c}) \propto p_{\theta_\text{ref}}(\bm{x}_0|\bm{c}) \exp\left( r(\bm{c}, \bm{x}_0)/\beta \right)

Yet, diffusion models do not admit tractable evaluation of marginal probabilities over x0\bm{x}_0 due to the complexity of the denoising chain. The solution is to leverage trajectory-level noise prediction errors as surrogate preference signals, operationalized in noise prediction space.

3. Ranking-Based Denoising Margin Formulation

Diffusion-DRO employs a denoising margin loss at each diffusion timestep tt: Lmm(ϕ)=t=1TE[ϵˉϵϕ(xˉt,c,t)2ϵθ(xt,c,t)ϵϕ(xt,c,t)2]\mathcal{L}_{\mathrm{mm}}(\phi) = \sum_{t=1}^T \mathbb{E}\left[ \left\| \bar{\bm{\epsilon}} - \bm{\epsilon}_\phi(\bar{\bm{x}}_t, \bm{c}, t) \right\|^2 - \left\| \bm{\epsilon}_\theta(\bm{x}_t, \bm{c}, t) - \bm{\epsilon}_\phi(\bm{x}_t, \bm{c}, t) \right\|^2 \right] where xˉt\bar{\bm{x}}_t is a noisy expert sample, xt\bm{x}_t is a policy trajectory sample, ϵϕ\bm{\epsilon}_\phi the reward/candidate model's predicted noise, and ϵθ\bm{\epsilon}_\theta the policy's predicted noise. This formulation pushes the distribution of expert samples and policy samples apart in noise prediction space, implicitly enforcing the max-margin preference criterion.

The framework introduces a thresholded ranking loss (TRL) to avoid excessive optimization on already well-ranked samples: LTRL(ϕ)=t=1TE[max(m,(ϵˉϵθref(xˉt,c,t)2ϵˉϵϕ(xˉt,c,t)2)+(ϵϵθref(xt,c,t)2ϵϵϕ(xt,c,t)2))]\mathcal{L}_{\mathrm{TRL}}(\phi) = \sum_{t=1}^T \mathbb{E}\left[ \max\left( m, -(\|\bar{\bm{\epsilon}}-\bm{\epsilon}_{\theta_\text{ref}}(\bar{\bm{x}}_t, \bm{c}, t)\|^2 - \|\bar{\bm{\epsilon}}-\bm{\epsilon}_\phi(\bar{\bm{x}}_t, \bm{c}, t)\|^2) + (\|\bm{\epsilon} - \bm{\epsilon}_{\theta_\text{ref}}(\bm{x}_t, \bm{c}, t)\|^2 - \|\bm{\epsilon} - \bm{\epsilon}_\phi(\bm{x}_t, \bm{c}, t)\|^2) \right) \right] where mm is a margin clipping parameter. This mechanism enhances stability and generalization by preventing the model from collapsing onto extreme solutions or overfitting select preference signals.

4. Algorithmic Structure and Training Procedure

Diffusion-DRO integrates both offline and online learning sources. Offline data consist purely of expert demonstrations—images with high preference scores. Online negatives are generated with the evolving policy during fine-tuning. Unlike prior schemes (Diffusion-KTO, SPIN-Diffusion), which risk static semantic bias in negative sampling, Diffusion-DRO’s adaptive online negatives ensure diversified and competitive contrast throughout training.

The algorithm proceeds as follows:

  1. Initialize both policy and reward/candidate model parameters, typically from pretrained weights.
  2. For each update batch:
    • Sample prompts, demonstrations, and timesteps.
    • Forward propagate expert and policy samples through the model, generating noised states.
    • Evaluate margin and thresholded losses, updating both the reward and policy model (periodic synchronization).
  3. Iterate for a fixed number of steps.

5. Conceptual Distinctions from DPO and Prior Preference Alignment Methods

Diffusion-DRO eschews auxiliary reward models and ad-hoc weighting strategies. Preference is induced implicitly in the denoising trajectory, not via explicit cross-entropy, sigmoid, or pairwise ranking objectives. Negative samples are not statically mined nor heuristically chosen: they arise naturally from the online policy, mitigating both overfitting and semantic collapse. The inverse RL derivation guarantees convergence and stability in the preference-aligned distribution.

In contrast, DPO and related schemes (e.g., Diffusion-KTO) depend on cross-entropy objectives, require negative example mining, and may necessitate additional reward model training—introducing variance and data bias absent in Diffusion-DRO.

6. Empirical Performance and Evaluation

Diffusion-DRO is benchmarked on large-scale datasets (Pick-a-Pic v2, HPDv2), using PickScore, Human Preference Score v2, Aesthetic Score, CLIP Score, and ImageReward metrics for both in-domain and out-of-domain prompts. Empirically:

  • Win rates for Diffusion-DRO against SOTA baselines (DPO, KTO, SPIN-Diffusion) consistently surpass 70–80% across all evaluation metrics.
  • User studies (MTurk head-to-head on HPDv2) indicate a preference rate of 56–75% in favor of Diffusion-DRO compared to standard finetuned diffusion models and DPO variants.
  • Ablation confirms robust preference alignment with limited expert demonstration data.
  • Diffusion-DRO generalizes without reduction in diversity or coverage, holding preference alignment even on unseen domains and diverse generation tasks.

7. Practical Implementation and Model Resources

Complete codebase and pretrained models are released at https://github.com/basiclab/DiffusionDRO. Integration is compatible with large pretrained diffusion backbones (e.g., SDXL), leveraging only expert demonstration sets and policy-generated negatives; no paired data or explicit reward networks are required.

Aspect Diffusion-DRO Prior DPO/Preference Methods
Reward Model Implicit (margin in denoising space) Explicit reward or evaluator
Data Needs Expert demonstrations + online negatives Paired comparisons, negative sets
Training Objective Max-margin ranking (denoising trajectory) Cross-entropy, sigmoid ranking
Theoretical Grounding Max-margin IRL, stationary policy alignment Pairwise ranking, not IRL-grounded
Avoids Semantic Bias Yes (negatives online, adaptive) No (static negatives, possible bias)
Quantitative Performance Outperforms SOTA baselines across metrics Lower, subject to instability

8. Significance and Implications

Diffusion-DRO reframes preference optimization in diffusion models as a principled max-margin IRL problem, allowing stable, preference-aligned fine-tuning with minimal annotation effort and without auxiliary models. Its loss formulation, operating in denoising space, exploits the structure of the diffusion process for robust generalization, exceeding previous baselines in extensive empirical studies.

A plausible implication is that the removal of explicit pairwise ranking and reward model dependence results in more scalable, generalizable, and interpretable preference alignment pipelines—potentially applicable beyond text-to-image diffusion to broader reward-aligned generative modeling. The framework’s robustness to domain shifts, semantic diversity, and annotation scarcity marks a significant advance in scalable human-centric fine-tuning of generative diffusion technologies.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Diffusion Denoising Ranking Optimization (Diffusion-DRO).