Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inference-Time Alignment for Diffusion Models

Updated 17 January 2026
  • The paper introduces Sampling Demons, a training-free method that steers diffusion model outputs toward higher reward outcomes without modifying model parameters.
  • It employs stochastic, derivative-free optimization over candidate noise vectors at each denoising step, handling both differentiable and non-differentiable rewards.
  • Empirical results on models like Stable Diffusion demonstrate significant gains in aesthetic and prompt adherence metrics, showcasing its plug-and-play integration potential.

Inference-time alignment for diffusion models refers to a family of methodologies that steer the sampling process of a pre-trained diffusion model toward outputs that maximize user-specified reward functions—such as human aesthetic preference, prompt-adherence, or compositional constraints—without retraining or modifying model parameters. Recent advances demonstrate that stochastic, derivative-free optimization over denoising noise trajectories enables training-free, plug-and-play alignment applicable to both differentiable and non-differentiable reward signals, including those from APIs and human evaluation (Yeh et al., 2024). The Sampling Demons framework is the first approach to do this without backpropagation, achieving substantial improvements in reward metrics and providing a generic template for preference alignment in diffusion-based generative models.

1. Conceptual Foundations and Alignment Objective

Traditional diffusion sampling generates outputs from an unconditional or prompt-conditioned prior by iteratively denoising a noise trajectory, typically using a learned score network. The alignment objective in inference-time alignment is reformulated as maximizing the expected reward R(x0)R(x_0) of the final clean sample x0x_0 with respect to the noise variables injected during the backwards diffusion process:

maxqEz1,,zTq[R(x0(z1,,zT))]\max_{q} \mathbb{E}_{z_1,\ldots,z_T \sim q} [ R(x_0(z_1,\ldots,z_T)) ]

where TT is the total number of denoising steps and qq denotes the sampling distribution over the noise vectors ztz_t at each step. Unlike previous approaches that retrain the model or require differentiable rewards, the Sampling Demons strategy performs per-step, derivative-free stochastic optimization of ztz_t, holding the previous state xtx_t fixed and directly maximizing an estimated reward for the denoising outcome seeded with ztz_t (Yeh et al., 2024).

2. Theoretical Framework: Noise Control and Approximate Reward Prediction

Under the SDE/EDM formalism for diffusion models, each denoising step can be written as a discretized numerical integrator (Heun's method):

xtΔ=heun(xt,z,t,Δ)=xt12[f(xt,t)+f(x~,tΔ)]Δ+12[g(t)+g(tΔ)]zΔx_{t-\Delta} = \text{heun}(x_t, z, t, \Delta) = x_t - \frac{1}{2}[f(x_t, t) + f(\tilde{x}, t-\Delta)]\Delta + \frac{1}{2}[g(t) + g(t-\Delta)] z \sqrt{\Delta}

where zN(0,I)z \sim \mathcal{N}(0, I) and x~\tilde{x} is the Euler-predictor. The expected reward at step tt, rB(xt)=Ex0xt[R(x0)]r_B(x_t) = \mathbb{E}_{x_0|x_t}[R(x_0)], can be approximated deterministically using PF-ODE mapping c(xt)x0c(x_t) \simeq x_0, yielding the estimated reward r^(xt):=R(c(xt))\hat{r}(x_t) := R(c(x_t)). The key theoretical result is that the change in estimated reward across candidate noises is affine in z(k)z^{(k)}:

r^(xtΔ(k))r^(xt)g(t)[xtrB(xt)z(k)]Δ+o(Δ)\hat{r}(x_{t-\Delta}^{(k)}) - \hat{r}(x_t) \approx g(t) [\nabla_{x_t} r_B(x_t) \cdot z^{(k)} ] \sqrt{\Delta} + o(\sqrt{\Delta})

establishing that controlling zz directly modulates the instantaneous reward gain (Yeh et al., 2024).

3. Sampling Demons Algorithm: Greedy Noise Selection

The Sampling Demons algorithm proceeds as follows:

  1. For each reverse diffusion step tt:
    • Draw KK candidate noises z(k)N(0,I)z^{(k)} \sim \mathcal{N}(0, I).
    • Propagate each candidate through the Heun update and estimate its terminal reward via R(c(xtΔ(k)))R(c(x_{t-\Delta}^{(k)})).
  2. Assign weights to candidates using a stochastic optimizer:
    • Tanh Demon: bk=tanh[(RkRˉ)/τ]b_k = \tanh[(R_k - \bar{R})/\tau]
    • Boltzmann Demon: bk=exp(Rk/τ)/jexp(Rj/τ)b_k = \exp(R_k/\tau) / \sum_j \exp(R_j/\tau)
  3. Form an optimal noise vector z=kbkz(k)z^* = \sum_k b_k z^{(k)}, reproject to the sphere, and make the final update.

The process iterates from tmaxt_\mathrm{max} to tmint_\mathrm{min}, steering the entire sampling trajectory toward regions of higher terminal reward. No gradients of RR are required, enabling compatibility with arbitrary black-box reward functions (including external APIs and human feedback).

4. Integration of Non-Differentiable Rewards and Practical Hyperparameters

Sampling Demons uniquely supports non-differentiable reward sources by evaluating RR only at candidate proposals in each step, eschewing any backpropagation. This allows direct integration of third-party VLM metrics, human-in-the-loop selection, or other opaque quality functions as alignment objectives.

Key hyperparameters include:

  • KK (number of candidates per step): Trade-off between search quality and linear compute cost.
  • TT (number of denoising steps): Inherited from standard samplers, typically 20–64.
  • β\beta (SDE noise injection parameter): Small values (β=0.1\beta = 0.1) for PF-ODE validity.
  • τ\tau (temperature): Adaptive in Tanh Demon, small in Boltzmann Demon.
  • Mapping c()c(\cdot): Distilled consistency models can reduce reward-estimation overhead by $3$–10×10\times with modest fidelity loss.

Scaling compute—i.e., KTK \cdot T reward queries plus TT evaluations of c()c(\cdot)—determines sample throughput and alignment precision.

5. Empirical Evaluation: Quantitative Gains and Reward Flexibility

Sampling Demons was benchmarked on Stable Diffusion v1.4 and SDXL v1.0 using LAION Aesthetic Score (0–10 scale), with 20 “animal” prompts. Results indicate:

Config Baseline Score SD Demon Score Best-of-N Sampling DOODL (Backprop)
SD v1.4 (K=16, T=16) 5.34±0.56 7.35±0.40 6.5 (upper bound) ~5.6
SDXL (K=16, T=16) 6.30±0.35 7.50±0.31 7.0 n/a

Notable gains are observed across other metrics (ImageReward, PickScore, HPSv2). For non-differentiable rewards, VLM API (Gemini/GPT-4, 14/16 cases improved PickScore), and human-in-the-loop selection demonstrated effective alignment improvements (DINOv2 cosine raised from 0.594 → 0.708).

Sampling time scales with candidate and step counts (e.g., 5 min/image for K=16K=16, T=16T=16 on RTX 3090). Using distilled consistency models (Tanh-C) can increase T to 64 for comparable computational cost.

6. Limitations, Integration, and Future Extensions

Sampling Demons is not immune to certain practical constraints:

  • Compute and memory scale with reward evaluations (KTK \cdot T) and TT calls to c()c(\cdot).
  • Approximation error in PF-ODE emerges with high curvature rewards (2R\nabla^2 R large) or excessive noise (β\beta large).
  • For maximum flexibility, Demons can be integrated into any diffusion sampler by replacing the per-step noise draw.
  • Using consistency models provides efficiency at the expense of some fidelity.
  • Ongoing research targets improved distillation for reward predictors, adaptive step sizes, dynamic candidate allocation, and compatibility with classifier-free guidance.

In summary, Sampling Demons represents a plug-and-play, training-free alignment protocol for diffusion models, highly effective for both differentiable and non-differentiable reward functions and validated by strong empirical gains in text-to-image preference metrics, human aesthetic scores, and third-party evaluations—all without the need for backpropagation or model modifications (Yeh et al., 2024).

7. Context Within the Inference-Time Alignment Ecosystem

The Sampling Demons approach is positioned among a broader class of inference-time alignment algorithms, including gradient-based guidance, direct noise optimization, beam and tree search, evolutionary algorithms, and various sampling and selection schemes. Compared to these, the distinctive features of Sampling Demons are its strict training-free operation, universal black-box reward compatibility, and its affine control of denoising noise per step with theoretically justified PF-ODE reward prediction.

While gradient-based optimization approaches may achieve alignment for differentiable rewards (with susceptibility to reward hacking and higher memory cost), and search or population-based methods (e.g., best-of-N, tree and evolutionary search) can yield strong reward alignment at higher compute budgets, Sampling Demons fills a unique niche by supporting arbitrary reward sources and requiring only evaluation (not differentiation), enabling seamless human-in-loop and API integration. Its empirical performance on aesthetics, prompt adherence, and non-standard metrics establishes it as a state-of-the-art inference-time alignment method in contemporary diffusion modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inference-Time Alignment for Diffusion Models.