Papers
Topics
Authors
Recent
2000 character limit reached

Delta-Denoising: Zero-Shot Generative Editing

Updated 28 November 2025
  • Delta-Denoising is a set of zero-shot generative editing algorithms that leverage differences in diffusion denoising trajectories to localize content edits.
  • It utilizes delta attribution, mask-guided latent inpainting, prompt refinement, and attention modulation to achieve precise, context-aware modifications without overfitting.
  • The approach is applicable in both image and audio-visual domains, demonstrating competitive anomaly detection and realistic edit synthesis performance.

Delta-Denoising (DeltaDeno) encompasses a set of training-free, zero-shot generative editing algorithms that leverage the per-step discrepancies between diffusion model denoising trajectories under different conditioning prompts. By explicitly contrasting the denoising paths induced by semantically minimal prompt variations—such as "a photo of a {object}" versus "a photo of a {object} with {anomaly}"—these frameworks localize and induce content-constrained edits (e.g., defect synthesis, cross-modal transformation) without overfitting to category priors or requiring anomalous samples. Delta-Denoising approaches have been formulated in both image and audio-visual domains and are characterized by their reliance on delta attribution, mask-guided latent inpainting, prompt refinement, and attention modulation for precise, context-aware generation (Xu et al., 21 Nov 2025, Lin et al., 26 Mar 2025).

1. Theoretical Foundation: Delta Attribution in Diffusion

Delta-Denoising rests on synchronizing two reverse diffusion trajectories within a pre-trained generative model. Given a base latent encoding z0n=E(xn)z_0^n = \mathbb{E}(x^n) representing the original instance (e.g., an image or video/audio segment), and two textual prompts—pnp_n (normal) and pap_a (anomalous)—one obtains their respective embeddings ene_n and eae_a. At each diffusion timestep tt, the noisy latent zt=αtz0+σtϵz_t = \alpha_t z_0 + \sigma_t \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0,I), is denoised under both conditions:

zt1(n)Φ(zt(n),ϵ^cfg(zt(n),t,en)),zt1(a)Φ(zt(a),ϵ^cfg(zt(a),t,ea))z_{t-1}^{(n)} \equiv \Phi(z_t^{(n)}, \hat{\epsilon}_{\mathrm{cfg}}(z_t^{(n)}, t, e_n)), \qquad z_{t-1}^{(a)} \equiv \Phi(z_t^{(a)}, \hat{\epsilon}_{\mathrm{cfg}}(z_t^{(a)}, t, e_a))

The core signal is the spatial 2\ell_2 distance ("delta") at each channel and spatial coordinate uu:

Δt1(u)=dt1(u)=zt1(n)(u)zt1(a)(u)2\Delta_{t-1}(u) = d_{t-1}(u) = \|z_{t-1}^{(n)}(u) - z_{t-1}^{(a)}(u)\|_2

Large Δt1(u)\Delta_{t-1}(u) values flag regions where the model’s generative prior is most sensitive to the semantic difference induced by the prompt change, thereby identifying plausible edit/defect regions (Xu et al., 21 Nov 2025).

In cross-modal applications (e.g., AvED), the delta is computed at the level of classifier-free guided noise predictions for both audio and video modalities:

Δt=ϵwϕ(zt,ytrg,t)ϵwϕ(zt,ysrc,t)\Delta_t = \epsilon^\phi_w(z_t, y_{\mathrm{trg}}, t) - \epsilon^\phi_w(z_t, y_{\mathrm{src}}, t)

with ϵwϕ\epsilon^\phi_w denoting the noise-prediction function under classifier-free guidance (Lin et al., 26 Mar 2025).

2. Delta Accumulation and Region Localization

Per-step deltas are accumulated across a scheduled interval of the denoising trajectory, typically from the start time tstartt_\mathrm{start} to a midpoint tmid=T/2t_\mathrm{mid} = \lfloor T/2 \rfloor. The accumulated map St1(u)S_{t-1}(u) integrates deltas at each spatial location:

St1(u)=St(u)+Δt1(u)withStstart(u)=0S_{t-1}(u) = S_t(u) + \Delta_{t-1}(u) \quad\text{with}\quad S_{t_\mathrm{start}}(u) = 0

At tmidt_\mathrm{mid}, Stmid(u)S_{t_\mathrm{mid}}(u) is smoothed (e.g., Gaussian blur) and normalized; thresholding yields a binary mask Mmid(u)M_\mathrm{mid}(u) highlighting regions that experience the highest semantic shift under prompt intervention:

Mmid(u)=1{S^(u)>τmid}M_\mathrm{mid}(u) = \mathbf{1}\{\hat{S}(u) > \tau_\mathrm{mid}\}

This mask is then used to steer subsequent latent inpainting by restricting modifications to identified regions, preserving global context while synthesizing localized anomalies (Xu et al., 21 Nov 2025).

3. Mask-Guided Latent Inpainting and Editing

Following mask generation, late denoising steps (t<tmidt < t_\mathrm{mid}) utilize the localization mask to restrict generative edits. The newly predicted latent z~t1\tilde{z}_{t-1} is fused with a forward-noised reference zt1srcz_{t-1}^{\mathrm{src}} (i.e., the normal image subjected to equivalent noise) as follows:

zt1=Mmidz~t1+(1Mmid)zt1srcz_{t-1} = M_\mathrm{mid} \odot \tilde{z}_{t-1} + (1 - M_\mathrm{mid}) \odot z^{\mathrm{src}}_{t-1}

This inpainting mechanism confines anomalous synthesis to mask regions, ensuring background consistency and supporting the generation of realistic, locally constrained defects (Xu et al., 21 Nov 2025).

4. Prompt Refinement and Attention Biasing

To improve attribution fidelity and edit controllability, Delta-Denoising algorithms refine token embeddings within the anomaly prompt prior to denoising. Anomaly token embeddings eaje_a^j undergo semantic distillation and context alignment by minimizing:

Lanom=jA[(1cos(eaj,edetail))+λeajedetail22]\mathcal{L}_{\mathrm{anom}} = \sum_{j \in \mathcal{A}} \left[(1 - \cos(e_a^j, e_{\mathrm{detail}})) + \lambda\|e_a^j - e_{\mathrm{detail}}\|_2^2\right]

Lctx=1ZjAeajeˉctx22\mathcal{L}_{\mathrm{ctx}} = \frac{1}{Z'}\sum_{j \notin \mathcal{A}}\|e_a^j - \bar{e}_{\mathrm{ctx}}\|_2^2

Non-anomaly tokens are forced toward their mean, ensuring the anomaly concept is represented as a distinct, linguistically dense embedding (Xu et al., 21 Nov 2025). Additionally, attention bias is imposed on U-Net cross-attention blocks to amplify the anomaly token's spatial focus within predicted defect regions:

Al=softmax(Ql(Kl)T+βMoaTdh)A^l = \mathrm{softmax}\left(\frac{Q^l(K^l)^T + \beta M o_a^T}{\sqrt{d_h}}\right)

This focus further sharpens spatial attribution, resulting in more precise and semantically coherent edits.

5. Cross-Modal Delta Denoising and Zero-Shot Editing

Delta-Denoising extends to cross-modal scenarios, as exemplified in the AvED framework for audio-visual editing (Lin et al., 26 Mar 2025). At each denoising step, deltas Δt(a)\Delta_t^{(a)}, Δt(v)\Delta_t^{(v)} are computed for audio and video latents, allowing manipulation toward a joint target specified by a text prompt. The DDPM-style reverse posterior is modified to:

pθ(xt1xt,Δt)=N(xt1;μθ(xt)+CtΔt,σt2I)p_\theta(x_{t-1}|x_t, \Delta_t) = \mathcal{N}\left(x_{t-1}; \mu_\theta(x_t) + C_t \Delta_t, \sigma_t^2 I\right)

Objective functions combine Delta Denoising Score (DDS) losses and cross-modal contrastive loss on attention-derived hidden features:

LDDS=Δv2+Δa2 LCMD=Patch-level contrastive loss(Ha,trg+,Hv,src)\mathcal{L}_\mathrm{DDS} = \|\Delta^v\|^2 + \|\Delta^a\|^2 \ \mathcal{L}_\mathrm{CMD} = \text{Patch-level contrastive loss}(H_{a,\mathrm{trg}}^+, H_{v,\mathrm{src}}^-)

Gradient descent is applied to latents rather than network weights, producing temporally aligned, semantically matched edits across modalities (Lin et al., 26 Mar 2025).

6. Quantitative Performance

On the MVTec AD dataset (15 categories), DeltaDeno achieves Inception Score (IS) of 2.03 (\uparrow vs. 2.02 for AnomalyAny), IC-LPIPS of 0.36 (\uparrow vs. 0.33), image-level AUROC of 84.7% (vs. 77.6%), image-level AP of 91.3% (vs. 87.1%), pixel-level AUROC of 84.9% (vs. 83.3%), and pixel-level AP of 30.5% (vs. 26.5%) (Xu et al., 21 Nov 2025). In cross-modal AvED, DINO increases from 0.921 to 0.956, LPAPS drops from 5.93 to 5.55, and AV-Align rises from 0.33 to 0.42 with cross-modal delta compared to single-modality baselines (Lin et al., 26 Mar 2025).

Method IS Image AUROC Pixel AUROC LPIPS AV-Align
DeltaDeno 2.03 84.7% 84.9% 0.36
AnomalyAny 2.02 77.6% 83.3% 0.33
AvED (single) 5.93 0.33
AvED (full) 5.55 0.42

These results indicate competitive gains in both realism and anomaly-detection value.

7. Algorithmic Summary

A high-level pseudocode for DeltaDeno anomaly generation (Xu et al., 21 Nov 2025):

  1. Extract a foreground mask using a segmenter (e.g., SAM), and encode the image to its latent.
  2. Forward-noise the latent to an intermediate timestep.
  3. Refine the anomaly prompt embedding via anomaly/context loss.
  4. Initialize two diffusion branches: normal and anomaly.
  5. Accumulate step-wise delta differences, updating S.
  6. At midpoint, threshold and normalize S to generate a spatial mask.
  7. In later diffusion, apply mask-guided inpainting, restricting generation to detected regions.
  8. Decode the final latent to obtain the edited image and its associated mask.

For cross-modal editing (AvED) (Lin et al., 26 Mar 2025), the delta is separately computed for audio and video, with cross-attention patch-level contrastive losses enforcing semantic alignment across modalities throughout the editing trajectory.

Delta-Denoising methods combine prompt-differentiated denoising attribution, robust spatial masking, explicit semantic reinforcement in prompt space, and spatially modulated attention to enable training-free, fine-grained edit localization in both single- and multi-modal generative settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Delta-Denoising (DeltaDeno).