Zero-Shot & Diffusion Deblurring

Updated 8 March 2026

The topic outlines how diffusion models recast deblurring as a stochastic inverse problem, ensuring data fidelity through guided sampling.
It details methodologies that integrate physics-based degradation operators and advanced conditioning to address both static and temporal blur.
Performance benchmarks demonstrate state-of-the-art PSNR/SSIM and accelerated sampling, while challenges remain in computational intensity and generalization.

Zero-shot and diffusion-based deblurring comprises a class of image and video restoration methods that exploit the generative capacity of diffusion models to remove blur under minimal or no task-specific retraining. These methods leverage either pretrained unconditional or conditional diffusion models, or dynamically constructed priors, to solve both non-blind and blind deblurring problems in a highly flexible, data-adaptive, and physically grounded manner. By integrating physics-based degradation operators, advanced sampling and optimization strategies, and—where applicable—self-supervised learning, they set a new standard for model generality, data consistency, and sample realism across static and temporal domains.

1. Foundational Formulations and Diffusion Inverse Problem Frameworks

Zero-shot diffusion-based deblurring recasts classical linear and non-linear deconvolution as a stochastic generative inverse problem, whose solution is obtained by guided sampling from a learned (or dynamically learned) prior.

For non-blind deblurring, the image observation model typically follows

$y = Hx_0 + n,$

where $y$ is the blurry measurement, $H$ is a known degradation (blur) operator (often a convolution, possibly temporally averaging), $x_0$ the unknown sharp image (or, for temporal blur, a sequence), and $n$ additive noise. The corresponding posterior is formulated as

$p(x_0 | y) \propto p(y | x_0)\,p(x_0),$

with $p(x_0)$ modeled by a diffusion process.

For blind deblurring, both $x_0$ and $H$ (or the blur kernel $k$ ) are unknown, increasing the ill-posedness.

Diffusion-based models (e.g., DDPM, DiT, Consistency Models) simulate the forward corruption of real data into noise and train neural networks to reverse this map. The reverse process can be guided for restoration through posterior sampling, which injects a data-fidelity (likelihood) term into the learned score or sampling dynamics:

$\nabla_{x_t} \log p(x|y) = \nabla_{x_t} \log p(x) + \lambda_t \nabla_{x_t} \log p(y|x).$

The precise methodology for handling the measurement operator $H$ varies across approaches, from explicit null-space projection to deep encoder conditioning.

2. Architectures, Guidance, and Network Conditioning in Zero-Shot Deblurring

Zero-shot models can operate with either unmodified pretrained diffusion backbones or with specialized architectural adaptations.

Classical (Unmodified) Posterior Sampling:

Frameworks such as DDNM/Null-space Diffusion (Wang et al., 2022) and DPS employ strict data-consistency projections at each reverse diffusion step, decomposing the image space via the pseudo-inverse $H^\dagger$ into range and null components, and enforcing $H \hat{x}_{0|t} = y$ at every iteration.

Advanced Conditioning:

InvFussion (Elata et al., 2 Apr 2025) directly integrates the degradation operator $H$ and the measurement $y$ into the feature space of each network block via a Feature Degradation Layer, concatenating $H(V)$ and $y$ before mapping back to the image domain, and feeding these features into the attention stream. This supports both zero-shot flexibility (arbitrary $H$ at inference) and fast convergence, approaching the PSNR and sample realism of task-specific supervised methods.

Video and Temporal Blur:

For temporally induced blur, architectures like DiTVR (Gao et al., 11 Aug 2025) and VDM-MD (Pang et al., 22 Jan 2025) use Transformers or video-oriented diffusion models, embedding motion and frame correspondence (via patch-based or window-based attention, optical flow, or spatiotemporal neighbor caches) to achieve temporal alignment and coherence, essential for motion deblurring.

Blind Deblurring and Self-Diffusion:

DeblurSDI (Yang et al., 31 Oct 2025) adopts a fully dynamic, self-diffusion process in which two randomly initialized networks (a U-Net denoiser and a kernel generator) are jointly optimized per instance at each noise level. This avoids reliance on external priors, learning an instance-specific prior by coarse-to-fine reverse diffusion driven by data consistency and kernel sparsity.

3. Posterior and Consistency-guided Sampling Algorithms

Diffusion-based deblurring employs a range of posterior sampling and physics-guided iterative methods:

Null-Space and Data-Consistency Projection:

The DDNM approach (Wang et al., 2022) alternates denoising and hard projection steps: predicting the clean image from a noisy sample, projecting range-space components to exactly match the blurred observation, and refining null-space components via diffusion. DDNM $^+$ adds tunable correction scaling and time-travel strategies to address noise amplification and ensure global harmony, especially for large-kernel blur.

Likelihood Score Balancing and Fast Sampling:

Zero-Shot Approximate Posterior Sampling (ZAPS) (Alçalar et al., 2024) tunes per-step likelihood weights using a test-time, physics-driven loss that measures measurement consistency at the output, enabling robust and accelerated sampling under arbitrary (including irregular, non-uniform) diffusion schedules. This is critical for practical inference speed and robustness to schedule design.

Bypass and Accelerated Samplers:

The Quick Bypass Mechanism (QBM) (Tai et al., 6 Jul 2025) enables rapid inference by skipping early coarse denoising steps, initializing the process with a noise-augmented pseudo-inverse deblurred image at an intermediate noise level, and using a Revised Reverse Process (RRP) with increased stochasticity to absorb distributional deviations. This dramatically reduces the required number of iterations (10 $\times$ –20 $\times$ speedup) while matching or exceeding standard zero-shot deblurring quality.

Consistency Models and Few-Step Methods:

CM4IR (Garber et al., 2024) distills the sampling process into a handful ( $\sim$ 4) feedforward steps through a consistency model. It combines smart pseudo-inverse initialization, back-projection guidance, and a novel anti-correlated noise injection, achieving high-fidelity deblurring and sharpness with dramatically fewer evaluations than classic DDPM/DDIM samplers.

4. Extensions: Temporal Deblurring and Video Diffusion Priors

Removal of motion blur—arising from temporal averaging rather than spatial convolution—is addressed by leveraging video diffusion priors and modeling the measurement as an average over a sequence of sharp (unknown) frames:

Motion-Blur Temporal Models:

VDM-MD (Pang et al., 22 Jan 2025) formulates single-image motion deblurring as the inversion of a temporal averaging process,

$y = \frac{1}{N} \sum_{n=1}^N x_n + e,$

where $x_n$ are latent sharp video frames. A pre-trained video diffusion transformer (DiT/STDiT) acts as a prior in a VQ-GAN-compressed latent space. Diffusion Posterior Sampling is performed in this latent space, iteratively enforcing data consistency by matching the blurred measurement to the average of the decoded frames, without explicit kernel estimation. Empirical performance demonstrates substantial PSNR/SSIM improvements over leading CNN deblurring baselines under both synthetic and real-world (BAIR, CLEVRER) motion blur.

Transformer-based Temporal Synchronization:

DiTVR (Gao et al., 11 Aug 2025) introduces trajectory-aware attention mechanisms within a transformer framework, aligning tokens along optical flows and injecting data consistency only in low-frequency (wavelet) spectral bands during sampling. This approach substantially improves temporal coherence and mitigates flicker/ghosting artifacts typical of frame-wise denoising, providing perceptually superior video deblurring in a fully zero-shot regime.

Conditional Video Diffusion:

DIVD (Long et al., 2024) implements a conditional video diffusion model with window-based temporal self-attention (WTSA) and multi-frame relative positional encoding. While trained and evaluated in a supervised setting, this architecture demonstrates that windowed alignment and temporally aware conditioning significantly improve perceptual quality (FID, LPIPS) at a modest PSNR cost, with WTSA components ablated to reveal contributions of each module.

Blind deconvolution in a zero-shot, diffusion-based context requires instance adaptive priors and dynamic optimization strategies:

Self-Diffusion with Instance-specific Priors:

DeblurSDI (Yang et al., 31 Oct 2025) eschews any pretrained priors, instead instantiating two neural networks (one for image, one for kernel) from scratch for each input, and jointly optimizing them using a noisy reverse diffusion process comprising a likelihood (data-consistency) term and an $\ell_1$ sparsity penalty on the kernel to encourage realistic blur estimation. A carefully designed noise schedule ensures stable convergence from large-scale, low-frequency reconstruction to fine detail. This achieves stable restoration and highly accurate kernel estimation across a diverse set of benchmarks (Levin, Cho, Kohler, FFHQ).

Learnable Physics-driven Guidance:

ZAPS (Alçalar et al., 2024) adaptively learns per-step posterior (likelihood) weights and Hessian approximations through a zero-shot, test-time, self-supervised loss that measures only data-consistency, updating the guidance weights via unrolled, differentiable sampling.

6. Performance Benchmarks and Practical Implications

Quantitative metrics across several works consistently indicate that diffusion-based zero-shot methods match or exceed prior state-of-the-art in both traditional (PSNR, SSIM) and perceptual (FID, LPIPS) measures. Key robust strategies include latent-space compression (VDM-MD), per-step weight tuning (ZAPS), null-space refinement (DDNM), and temporal alignment (DiTVR).

The table below summarizes representative results from several foundational works:

Model	Domain	PSNR	SSIM	FID / LPIPS / C-FID	NFE/Evals	Notable Strength
VDM-MD (Pang et al., 22 Jan 2025)	Single-img	24.24–30.26	0.896–0.914	–	1000	Excels at non-linear, temporal blur removal
DeblurSDI (Yang et al., 31 Oct 2025)	Blind, static	28.73–33.90	0.765–0.906	–	30×200	No pretraining, robust blind kernel recovery
DDNM (Wang et al., 2022)	Static	44.93	0.9937	1.15 (FID)	100–250	Null-space consistency, strong sample realism
QBM+RRP (Tai et al., 6 Jul 2025)	Static	45.89	0.996	–	5–10	10x–20x speedup over standard samplers
CM4IR (Garber et al., 2024)	Static	28.85	–	0.217 (LPIPS)	4	High fidelity in minimal evaluations
DiTVR (Gao et al., 11 Aug 2025)	Video	~32	0.7935	0.1870 (LPIPS)	1000	Temporal consistency, zero-shot video
InvFussion (Elata et al., 2 Apr 2025)	Static	22.59	–	4.69 (C-FID)	63	Fast convergence, broad $H$ generality

Performance is contingent on the match between the pretrained prior's domain and the true data, on accurate blur model specification (for non-blind cases), and—where applicable—on stabilization strategies for optimization and noise scheduling.

7. Limitations, Open Challenges, and Future Research

Despite their generality, current approaches exhibit several limitations:

Computational Intensity: Iterative/denoising chains (DDPM, DiT) have substantially higher inference times than direct regression or CNN-based approaches. Strategies like QBM, ZAPS, and CM4IR ameliorate this via accelerated sampling and step-skipping, but the cost remains non-trivial at high resolutions or for video.
Dependence on Pretrained Priors: Performance for non-dynamic approaches is bounded by the representational capacity and data coverage of the chosen pretrained diffusion model. Rare or out-of-distribution structures may be imperfectly reconstructed or hallucinated.
Blind Kernel Limitations: DeblurSDI currently models only spatially invariant blur; extension to spatially varying kernels is an open avenue. Treatment of specialized or real-world sensor noise models remains underexplored.
Temporal Alignment and Extreme Blur: In video, accuracy and success of temporal attention and flow-guided alignment are sensitive to estimation errors in optical flow, especially under extreme or non-uniform motion.
Generalization: Most supervised video diffusion models demonstrate excellent in-domain generalization but lack explicit evidence of strong cross-dataset or true zero-shot capability. Robustness of “zero-shot” paradigms in the wild, under unknown noise/blur statistics, remains a focus for further study (Long et al., 2024).

A plausible implication is that future research will increasingly focus on hybrid architectures that unify dynamic (instance-specific) and static (pretrained) priors, advanced physics-driven guidance, and more efficient samplers that close the remaining gap in distortion and perceptual metrics while maintaining practical inference times and robustness to novel degradations.