Smoothed Preference Optimization in Diffusion Models

Updated 25 December 2025

SmPO-Diffusion is a class of algorithms that aligns generative diffusion models with human preferences by incorporating smoothing via soft probabilistic label mixtures and forward KL regularization.
It employs a two-stage pipeline—behavior cloning followed by preference alignment—to prevent mode collapse and overfitting while maintaining multi-modal data coverage.
Empirical evaluations demonstrate state-of-the-art benchmarks in sequential decision-making, text-to-image generation, and LLM alignment with notable improvements in success rates and computational efficiency.

Smoothed Preference Optimization for Diffusion Models (SmPO-Diffusion) is a class of algorithms that align generative diffusion models and diffusion-based policies with human preferences via direct, preference-driven optimization. Unlike conventional reward-based or binary preference approaches, SmPO strategically incorporates smoothing techniques—such as forward Kullback–Leibler (KL) regularization or soft, probabilistic label distributions—to prevent mode collapse and overfitting to strict win/loss preferences, enabling robust preference alignment for sequential decision-making, LLM alignment, and text-to-image generation.

1. Mathematical Principles and Smoothed Preference Objectives

SmPO-Diffusion is founded on the modification and smoothing of canonical preference optimization objectives, especially Direct Preference Optimization (DPO). Consider the original binary DPO loss for preference pairs $(x_{0}^w,x_{0}^l)$ under prompt $c$ : $L_\text{DPO} = - \mathbb{E}_{(x_{0}^w, x_{0}^l, c) \sim D}\left[\log \sigma\Big(\beta \left[ \log\frac{p_\theta(x_{0}^w|c)}{p_\text{ref}(x_{0}^w|c)} - \log\frac{p_\theta(x_{0}^l|c)}{p_\text{ref}(x_{0}^l|c)} \right]\Big)\right]$ where $\sigma(\cdot)$ is the sigmoid, $p_\text{ref}$ is a reference policy, and $\beta$ is the regularization weight.

SmPO modifies this by introducing smoothed preference mixtures. Instead of treating human feedback as binary, SmPO defines soft densities $\tilde{p}(x_{0}^{w}|c)$ , $\tilde{p}(x_{0}^{l}|c)$ parameterized by $\alpha/\gamma$ values computed from a pretrained reward model: $\tilde{p}(x_{0}^{w}|c) \propto p(x_{0}^{w}|c)^\alpha \cdot p(x_{0}^{l}|c)^{\gamma-\alpha}$

$\tilde{p}(x_{0}^{l}|c) \propto p(x_{0}^{w}|c)^{\gamma-\alpha} \cdot p(x_{0}^{l}|c)^\alpha$

The resulting SmPO-Diffusion loss is: $L_\text{SmPO} = -\mathbb{E}_{(x_{0}^w,x_{0}^l,c)}\left[\log \sigma\left( (2\alpha-\gamma)\beta\left[\log\frac{p_\theta(x_{0}^w|c)}{p_\text{ref}(x_{0}^w|c)} - \log\frac{p_\theta(x_{0}^l|c)}{p_\text{ref}(x_{0}^l|c)}\right] \right)\right]$ Setting $\alpha=\gamma/2$ annihilates the preference loss, providing strong regularization when preference confidence is low (Lu et al., 3 Jun 2025).

In sequential decision-making, SmPO-Diffusion incorporates forward KL regularization into the DPO objective to produce the Forward KL Preference-Diffusion (FKPD) loss: $L(\theta) = \mathbb{E}_{(s,a^{+},a^{-}) \sim D}[-\log \sigma(f_\theta(s,a^{+}) - f_\theta(s,a^{-}))] + \lambda \mathbb{E}_{(s,a) \sim D}[-\log \pi_\theta(a|s)]$ where $f_\theta(s,a) = \rho\log \pi_\theta(a|s)$ and $\lambda$ , $\rho$ are temperature and regularization hyperparameters (Shan et al., 2024).

A central insight is that smoothing achieves a mass-covering constraint: the forward KL penalizes under-coverage, ensuring the policy remains close to the original multi-modal data distribution while aligning with preferences and avoiding catastrophic out-of-distribution sampling.

2. Algorithmic Frameworks and Stages

SmPO-Diffusion algorithms implement a two-stage pipeline in both policy alignment and generative modeling:

Stage 1: Behavior Cloning / Reference Model Training

Offline data is used to train a reference diffusion model (score-based or autoregressive) to fit the empirical, multi-modal distribution without reference to preferences.
The policy or generation model $\pi_\text{ref}$ or $p_\text{ref}$ is defined via standard DDPM objectives (mean-squared error on denoising steps) or autoregressive likelihoods.

Stage 2: Preference Alignment with Smoothing

A preference dataset $D_\text{pref}$ of pairs $(\sigma^+, \sigma^-)$ or $(x_{0}^w, x_{0}^l)$ is constructed (by human annotation or scripted teacher).
For each batch, losses are evaluated using preference-driven objectives smoothed by forward KL terms, soft preference distributions, or trajectory-based mixture estimations.
For diffusion models, preference alignment steps involve noising segments, computing score differences using either naive forward sampling or—preferably—ReNoise-inverted latents for tighter estimates (Lu et al., 3 Jun 2025).
The loss is minimized via gradient descent on model parameters.

Pseudocode for FKPD-style SmPO-Diffusion:

for epoch in alignment_phase:
    sample preference pairs
    sample reference data for regularization
    sample time steps, noise vectors
    calculate preference score differences
    calculate forward KL reg term
    total loss = -log σ(-ρ(pref_diff + μ reg_diff))
    update model parameters

(Shan et al., 2024)

3. Architectural and Implementation Details

SmPO-Diffusion utilizes architectures consistent with the task domain:

Policy Alignment (RL / Control):
- Network backbone: U-Net or Transformer-style, with time conditioning and state intervention.
- In MetaWorld, a 6-layer MLP per-timestep with FiLM conditioning; in D4RL, lightweight Transformer-style.
- Diffusion steps $T=1000$ for training, with 10–20 PLMS steps for rollout.
- Preference data comprised of length- $k$ trajectory segments, $k=64$ or $k=10$ depending on the benchmark.
Text-to-Image:
- DDPM/Score-based models with reference and target architectures unchanged.
- Reward models (e.g., PickScore) provide soft likelihood ratios for mixture coefficients.
- ReNoise inversion for trajectory score estimation leverages DDIM-type reversal, utilizing 9 inversion steps plus denoising (Lu et al., 3 Jun 2025).
- Hyperparameters: $\gamma \approx 10$ , $\beta=1500$ –5000, batch size $\sim1024$ , classifier-free guidance during inversion.
LLMs:
- SmPO-Diffusion (DiffPO) acts as a plug-and-play module (e.g., a lightweight LLM such as Gemma-2 2B or 9B) solely trained on consistency and AR objectives, never modifying base LLM weights.
- Parallel decoding enables batch refinement of full sentences in a few diffusion-style smoothing steps, facilitating model-agnostic blanket alignment (Chen et al., 6 Mar 2025).

4. Empirical Results and Evaluation

SmPO-Diffusion establishes state-of-the-art performance across diverse benchmarks:

Sequential Manipulation/Control (MetaWorld, D4RL):

FKPD outperforms SFT, CPL, and Preference-IQL on success rates; e.g., DrawerOpen: FKPD ~90% vs CPL ~83%, ButtonPress: FKPD ~35% vs CPL ~24% (Shan et al., 2024).
Ablations confirm the necessity of forward KL regularization: removing it (NRPD) produces severe out-of-distribution sampling; reverse KL (RKPD) partially collapses support.

Text-to-Image Alignment:

On HPDv2 and Parti-Prompts, SmPO-Diffusion achieves highest median rewards: PickScore 23.62 vs baselines 22.17/23.13, HPS v2.1 32.53 vs 28.39/30.06, ImageReward 1.331 vs 0.756/1.184.
User study: SmPO-SDXL chosen as best in 72% cases vs DPO at 22% (Lu et al., 3 Jun 2025).
Training cost reduction: SmPO-SDXL requires only 150 GPU-hours compared to 976 for DPO, indicating improved computational efficiency.

LLM Alignment:

Sentence-level DiffPO increases win-rate and judge scores on MT-bench, AlpacaEval 2, and HH-RLHF; MT-bench GPT-4 score improves from 6.21 → 7.45 as module size increases (Chen et al., 6 Mar 2025).
Ablations document reduced inference latency: block size 32 reduces wall-clock time from 1937s to 1012s with negligible loss.

Performance results are summarized in the table below:

Model/Task	Benchmark	SmPO-Diffusion	Best Baseline
MetaWorld	DrawerOpen	90%	83% (CPL)
MetaWorld	ButtonPress	35%	24% (CPL)
T2I (SDXL)	PickScore	23.62	23.13 (DPO)
T2I (SDXL)	HPS v2.1	32.53	30.06 (DPO)
LLM: MT-bench	GPT-4 Score	7.45 (9B)	6.21 (SFT)

5. Smoothing Mechanisms and Theoretical Insights

The smoothing aspect of SmPO-Diffusion is implemented via two complementary mechanisms:

Forward KL Regularization:
- Ensures mass-covering around $\pi_\text{ref}$ , penalizing under-coverage and constraining the model from drifting into low-density regions, offering trust-region safety (Shan et al., 2024).
- This is contrasted with reverse KL (mode-seeking, collapses support) and alternative divergences (Jensen–Shannon, Wasserstein).
Smoothed Preference Distributions:
- Soft mixture of winner/loser likelihoods mitigates excessive objective sensitivity and calibrates preference influence, reducing optimization instability (Lu et al., 3 Jun 2025).
- For uncertain or ambiguous feedback, the SmPO loss naturally neutralizes preference gradients, offering calibrated regularization.

These mechanisms promote robust generalization and resist overfitting to noisy or overly strict preference annotations.

6. Extensions and Practical Considerations

SmPO-Diffusion extends to various domains without major architectural changes:

Plug-and-play for LLMs:
- Compact modules can be trained once and wrap diverse LLMs, enabling policy-agnostic scaling, low-latency inference, and compatibility with black-box APIs (Chen et al., 6 Mar 2025).
ReNoise Inversion for Diffusion Trajectory Estimation:
- Efficient trajectory-based score approximation in text-to-image alignment leverages DDIM+single denoising step for tight preference optimization (Lu et al., 3 Jun 2025).
Hyperparameter Sensitivity:
- KL strength (μ), temperature (ρ), mixture coefficients (γ,α), and trajectory depth (T) critically determine the mass-covering versus preference-seeking trade-off.
- Empirically, moderate smoothing yields best results while extreme values either underfit or cause OOD collapse.
Computational Efficiency:
- SmPO-Diffusion requires substantially less training time than classical DPO due to the effective regularization and data utilization enhancements documented in SDXL scaling results (Lu et al., 3 Jun 2025).

7. Limitations and Plausible Implications

A plausible implication is that SmPO-Diffusion’s smoothing may dampen fine-grained alignment to sharp, unambiguous human preferences in scenarios where mass-covering is less critical. Hard mode-seeking objectives (e.g., reverse KL) might outperform on degenerate, unimodal tasks but risk severe OOD failures. Further, as SmPO-Diffusion relies on the calibration of reward models and reference policies, poor initialization or misaligned mixture estimation may bottleneck empirical performance. The extensibility to other generative domains (audio, video, molecular design) is suggested by the architecture-neutral core but requires further validation.

SmPO-Diffusion is distinguished by principled smoothing mechanisms, tractable training, general-purpose applicability, and empirical dominance across alignment, generative modeling, and policy learning benchmarks (Shan et al., 2024, Chen et al., 6 Mar 2025, Lu et al., 3 Jun 2025).