Papers
Topics
Authors
Recent
Search
2000 character limit reached

Delta Denoising Score (DDS) in Diffusion Models

Updated 9 March 2026
  • Delta Denoising Score (DDS) is a framework for precise text-guided image and 3D scene editing that uses differential denoising scores to enforce minimal yet semantically meaningful changes.
  • It computes latent differences by comparing noise predictions of source and target prompts, ensuring prompt-driven edits while preserving the original structure.
  • DDS demonstrates superior performance over traditional SDS, enhancing stability, detail preservation, and edit controllability in applications like creative design and neural field editing.

Delta Denoising Score (DDS) is a scoring and optimization framework developed for text-based image and neural field editing, grounded in the principles of diffusion models. The core idea is to achieve minimal yet semantically meaningful changes in an input image or 3D scene so that the output conforms to a user-specified target prompt, while preserving key aspects of the original content. DDS builds on and addresses the limitations of Score Distillation Sampling (SDS), offering improved stability, detail preservation, and edit controllability in both 2D and 3D domains (Nam et al., 2023, Le et al., 2024).

1. Theoretical Foundations and Motivation

DDS arises from the need to steer generative models, especially those based on diffusion, to edit existing content according to natural language instructions. Traditional Score Distillation Sampling (SDS), as introduced in frameworks like DreamFusion, provides semantics-aligned guidance by leveraging the gradient of a pre-trained text-to-image diffusion model under a new target prompt. However, empirical observations indicate that SDS often yields blurry, less-detailed, or semantically inconsistent results due to noisy optimization trajectories and insufficient constraints on structural preservation. DDS introduces a differential scoring approach: by measuring and optimizing the difference between the denoising scores (i.e., noise prediction residuals) of the output and input images under conditioning by their respective prompts, DDS enforces precise prompt-driven edits while mitigating undesirable deviations (Nam et al., 2023, Le et al., 2024).

2. Mathematical Formulation

Consider a denoising diffusion probabilistic model (DDPM) parameterized by a U-Net ϵϕ\epsilon_\phi and a generator or neural radiance field g(θ)g(\theta). For editing, two prompts are specified: the source prompt y^\hat y and the target prompt ytgty_{tgt}. The key steps are:

  • Score Distillation Sampling (SDS): For a noisy latent zt(θ)=atz0(θ)+btϵz_t(\theta) = a_t z_0(\theta) + b_t \epsilon, with ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I) and DDPM noise schedule coefficients (at,bt)(a_t, b_t), the SDS loss is:

LSDS(θ;ytgt)=Et,ϵ∥ϵϕω(zt(θ),ytgt,t)−ϵ∥2\mathcal{L}_\mathrm{SDS}(\theta; y_{tgt}) = \mathbb{E}_{t, \epsilon} \big\| \epsilon_\phi^\omega(z_t(\theta), y_{tgt}, t) - \epsilon \big\|^2

where ϵϕω\epsilon_\phi^\omega includes guidance via classifier-free mechanisms (Nam et al., 2023).

  • Delta Denoising Score (DDS) Loss: DDS simultaneously considers a reference latent z^0\hat z_0 (from the source image) and its noisy version z^t=atz^0+btϵ\hat z_t = a_t \hat z_0 + b_t \epsilon. The DDS loss is:

LDDS(θ;ytgt)=Et,ϵ∥ϵϕω(zt(θ),ytgt,t)−ϵϕω(z^t,y^,t)∥2\mathcal{L}_\mathrm{DDS}(\theta; y_{tgt}) = \mathbb{E}_{t, \epsilon} \big\| \epsilon_\phi^\omega(z_t(\theta), y_{tgt}, t) - \epsilon_\phi^\omega(\hat z_t, \hat y, t) \big\|^2

Its gradient can be written as the difference of SDS gradients computed for the target and source prompt-image pairs:

∇θLDDS=∇θLSDS(z,ytgt)−∇θLSDS(z^,y^)\nabla_\theta \mathcal{L}_\mathrm{DDS} = \nabla_\theta \mathcal{L}_\mathrm{SDS}(z, y_{tgt}) - \nabla_\theta \mathcal{L}_\mathrm{SDS}(\hat z, \hat y)

(Nam et al., 2023, Le et al., 2024).

This "delta" operation ensures that the optimization is directed towards satisfying the new prompt while actively countering drift away from the original image's structure and appearance.

3. Algorithmic Workflow

DDS is typically instantiated within an optimization loop that involves the following key steps (Le et al., 2024):

  1. Encode the source content (image or NeRF) and sample corresponding latents.
  2. For each optimization step, sample time step tt and Gaussian noise ϵ\epsilon.
  3. Generate noisy latents ztz_t and z^t\hat z_t from current and reference images.
  4. Compute the denoising scores ϵϕω\epsilon_\phi^\omega under both the target and source prompts.
  5. Evaluate the DDS loss and backpropagate its gradient with respect to generator parameters θ\theta.
  6. Update generator (and, if necessary, auxiliary network) weights via an optimizer (e.g., Adam).
  7. Optionally incorporate regularization or additional constraints (see below).

Algorithmic pseudocode for NeRF editing with DDS as a base is presented in (Le et al., 2024) (Algorithm 1), where additional steps may include LoRA fine-tuning of the U-Net for explicit identity preservation.

4. Practical Implications and Limitations

DDS significantly improves semantic editing fidelity and visual sharpness over SDS. Quantitative results indicate that DDS outperforms SDS and many contemporaneous methods on metrics such as CLIP similarity, DINO self-similarity, LPIPS, and in user study ratings for structure and overall quality (Nam et al., 2023). For example, on "cat→cow" edits, DDS achieves lower LPIPS and DINO distances than SDS-only methods.

However, DDS is still limited in its ability to preserve fine structural elements, especially for intricate edits involving spatial geometry (e.g., limb orientation, facial identity). Empirically, DDS alone can result in artifacts such as pose flipping or local deformations, and over prolonged optimization may induce undesired color saturation or textural drift (Nam et al., 2023, Le et al., 2024).

5. Advances Beyond DDS: Structural and Identity Preservation

Several extensions address the structural preservation gap in vanilla DDS:

  • Contrastive Denoising Score (CDS): CDS augments DDS with a patch-wise, contrastive loss computed using intermediate self-attention features from the diffusion model's U-Net. This enforces spatial correspondence between input and output, maximizing mutual information between matching patches. The combined loss is:

LCDS=LDDS+λconℓcon\mathcal{L}_\mathrm{CDS} = \mathcal{L}_\mathrm{DDS} + \lambda_\mathrm{con} \ell_\mathrm{con}

where â„“con\ell_\mathrm{con} is a patchNCE-style objective. CDS demonstrates improved CLIP, DINO, and LPIPS metrics, as well as superior user-rated structure preservation (Nam et al., 2023).

  • Variational Score Distillation (VSD) Regularizer: VSD addresses DDS's tendency to lose input-specific detail and oversaturate outputs by introducing a KL-minimization term between the edited and original image distributions. LoRA-adapted diffusion models are used to approximate input and output scores. The regularized optimization step simultaneously pushes toward the target prompt and pulls toward the original content (Le et al., 2024).

6. Empirical Evaluation

Empirical studies on image and NeRF editing benchmarks demonstrate the effectiveness of DDS and its successors:

Method CLIP-Acc (↑) DINO-Dist (↓) LPIPS (↓) User Structure Score
DDS 97.9% 0.023 0.080 4.05/5
CDS 97.5% 0.020 0.079 4.65/5
DDS (cat→cow) 99.6% 0.040 0.116
CDS (cat→cow) 97.9% 0.033 0.112

DDS outperforms prior methods in stability and quality but is further surpassed by CDS and VSD-regularized variants on both quantitative and human evaluation metrics (Nam et al., 2023, Le et al., 2024).

7. Applications and Extensions

DDS and its derivatives are utilized in:

  • 2D image editing: Text-driven modifications with minimal structural loss.
  • 3D scene editing: Neural radiance field (NeRF) optimization for semantic attribute changes.
  • Zero-shot image-to-image and neural field translation: Direct application without the need for paired training data.

The preservation of geometric and photometric fidelity is essential for real-world deployment in creative design, animation, and visual effects pipelines. Recent works have generalized the DDS concept, replacing the vanilla score subtraction with explicit identity and structure-preserving regularizers, thereby supporting multi-view consistency and more complex edits without additional data curation or pre-training steps (Le et al., 2024).


References: (Nam et al., 2023) Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing (Le et al., 2024) Preserving Identity with Variational Score for General-purpose 3D Editing

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Delta Denoising Score (DDS).