Depth-Guided Image Deblurring

Updated 14 January 2026

Depth-guided image deblurring is a technique that leverages explicit scene depth information from sensors like Lidar and stereo systems to accurately model spatially-varying blur.
It integrates depth cues with optimization-based and deep learning frameworks to jointly estimate latent sharp images, depth maps, and motion parameters.
Empirical results highlight that incorporating depth information significantly improves restoration metrics such as PSNR and SSIM while reducing artifacts in complex imaging scenarios.

Depth-guided image deblurring refers to computational methods that incorporate explicit scene depth information—obtained via active sensors (e.g., Lidar), passive multi-view systems, dual-pixel disparity, or single/multi-frame depth estimation—to enhance the recovery of sharp images from motion-blurred or defocused observations. Unlike conventional deblurring which assumes either a spatially-invariant or unknown blur, depth-guided approaches model the spatially-varying kernel as a function of the underlying scene geometry, thereby enabling accurate inversion of complex, non-uniform blur typical in real imaging scenarios.

1. Physical Modeling of Depth-Dependent Blur

The fundamental principle underlying depth-guided deblurring is that the magnitude and structure of image blur depend not only on the camera or object motion but also on the 3D geometry of the scene. Two main physical cases are distinguished:

A. Motion Blur with Depth Dependency

For camera shake or general 6 DoF motion, each pixel traces a trajectory on the image plane determined by both its depth and the camera path. The observed blur is formed by integrating warped images along the motion trajectory, yielding a spatially-varying kernel $k_x(u)$ at pixel $x$ : $B(x) = \int k_x(u)\,L(x-u)\,du + z(x)$ where $L$ is the latent sharp image and $z(x)$ is sensor noise. The kernel $k_x$ is parameterized by the depth map $D(x)$ and the camera motion vector $p$ ; for each time sample during exposure, 3D points are projected onto the moving sensor via $D(x)$ and $p$ (Pan et al., 2019, Park et al., 2017).

B. Defocus Blur and Depth

For wide-aperture systems, out-of-focus points are imaged as “circles of confusion” whose diameter is a deterministic function of the local depth: $b(d) = \frac{|d - f|}{d} \frac{f^2}{N c}$ where $d$ is distance to the object, $f$ focal length, $N$ F-number, and $c$ pixel size. The PSF at each pixel can be modeled as a disk or Gaussian whose width is set by the depth at that location (Nazir et al., 2023, Yang et al., 1 Jul 2025). This spatially-varying PSF is critical for reconstructing an all-in-focus image.

2. Algorithmic Frameworks and Architectures

Depth guidance enters deblurring via various algorithmic paradigms:

A. Optimization-based: Joint Energy Minimization

Explicit modeling of the blur process allows for joint estimation of latent image $L$ , depth $D$ , pose $p$ , and—in some works—super-resolution. This is performed by minimizing a composite energy functional: $E(L, D, p) = E_\text{data} + E_\text{motion} + E_\text{image reg.} + E_\text{flow}$ The data term enforces consistency between the synthesized blur (via $D$ and $p$ ) and the observation, while regularizers impose smoothness on the image and depth, and penalize implausible motion (Pan et al., 2019, Park et al., 2017). Alternating minimization (e.g., fixing $L$ to optimize $p$ , then vice versa) is applied in a coarse-to-fine pyramid, efficiently handling the non-convexity.

B. End-to-End Deep Learning: Depth Fusion and Conditioning

Modern approaches utilize encoder–decoder or transformer-based neural networks that incorporate depth as an additional conditioning signal:

Depth–RGB fusion via “adapter” blocks, which modulate image features using upscaled depth features. For instance, spatial attention maps derived from Lidar depth are applied element-wise to the image features at each decoder stage, followed by lightweight transformers (Yi et al., 2024, Yi et al., 7 Jan 2026).
Hard encoder-sharing, where a single backbone encodes the blurred image for both depth estimation and deblurring, enabling implicitly depth-aware features (Nazir et al., 2023).
Direct feature fusion using cross-correlation attention for dual-pixel disparity, capturing local disparity (i.e., depth) at multiple scales and guiding the deblurring network (Swami, 16 Feb 2025).
Latent diffusion models with depth guidance, where side information from Lidar is injected into all stages of the denoising process, e.g., via ControlNet (Montanaro et al., 11 Sep 2025).

C. Specialized Domains: Light-Field Deblurring and Simulated Data Synthesis

Light-field deblurring entails per-view kernel generation and angular attention, leveraging both the spatial and angular structure of the 4D light-field parameterized by depth (Shen et al., 2023).
Synthetic data pipelines simulate realistic, depth-dependent defocus using closed-form PSF models and spatially-varying optical aberrations to enable scalable training of deblurring networks (Yang et al., 1 Jul 2025).

3. Depth Acquisition Modalities and Integration Strategies

A. Active Sensing

Mobile Lidar and time-of-flight sensors directly provide depth maps even under adverse lighting or blur. These are super-resolved and aligned to the RGB image, then fed into the deblurring algorithm—either as direct input channels, or via feature adaptors (Yi et al., 2024, Yi et al., 7 Jan 2026, Montanaro et al., 11 Sep 2025).

B. Stereo, Dual-Pixel, and Multi-View

Passive stereo or dual-pixel sensors recover disparity fields, convertible to depth. Networks such as MCCNet use cross-correlation attention to extract and utilize these disparity cues at multiple encoder scales (Swami, 16 Feb 2025).

C. Monocular Depth from Defocus/Blur

In the absence of a sensor, depth can be estimated via depth-from-defocus algorithms or CNNs trained to infer depth from spatial blur patterns (Nazir et al., 2023). Such depth estimates can then either guide the deblurring or be used in a joint estimation network.

4. Quantitative and Qualitative Advancements from Depth Guidance

Depth-informed algorithms consistently outperform blind deblurring on both synthetic and real RGB-D datasets.

Model/Scenario	PSNR (dB)	SSIM	LPIPS	Speed (s)	Source
Restormer	34.52	0.9318	0.1369	46.56	(Yi et al., 2024)
Depth-Restormer (Lidar)	36.62	0.9446	0.1093	55.84	(Yi et al., 2024)
EDIBNet (depth, 32chan)	35.10	0.9681	—	0.40	(Yi et al., 7 Jan 2026)
ZSLDB (diffusion, Lidar)	24.0	0.83	0.1643	—	(Montanaro et al., 11 Sep 2025)
ZSLDB (no depth)	23.2	0.81	0.1821	—	(Montanaro et al., 11 Sep 2025)

Qualitative phenomena enabled by depth guidance include:

Restoration of spatially-varying blur across depth discontinuities (sharp boundaries at occlusions, improved fine texture at foreground/background transitions)
Reduced artifacts such as ringing or over-smoothing
Ability to synthesize sharp video sequences from a single blurry input via 3D-aware warping (Pan et al., 2019)
Sharpness recovery even for scenes with complex field-dependent optical aberrations (Yang et al., 1 Jul 2025)
Significant efficiency gains using adapters/wavelet-domain computation for resource-constrained devices (Yi et al., 7 Jan 2026)

Ablation studies consistently demonstrate that networks with explicit depth input (Lidar or high-fidelity estimation) provide several decibels improvement in PSNR, higher SSIM, and lower perceptual loss metrics compared to those using either no depth or depth inferred solely from blur (Yi et al., 2024, Nazir et al., 2023).

5. Specialized Depth-Guided Deblurring Domains

A. Light-Field Imaging

Depth-guided deblurring in light-field cameras is managed via modules such as view-adaptive spatial convolution (VASC) and depth-perception view attention (DPVA), which respectively adapt the convolutional kernel for each sub-aperture view and blend angular information according to local depth. These modules, combined with an angular position embedding, yield consistent epipolar plane images and sharp reconstruction across all LF views (Shen et al., 2023).

B. Synthetic Data and Simulation

Physically-based data synthesis pipelines that incorporate per-pixel depth-dependent PSFs and spatially-varying lens aberrations enable training of robust networks able to generalize to various camera optics and noise profiles. Depth discretization and “depth splatting” optimize computational efficiency to allow scalable, high-resolution simulations (Yang et al., 1 Jul 2025).

C. Zero-shot and Diffusion-based Deblurring

ControlNet-based conditional diffusion models, such as ZSLDB, operate in a zero-shot and training-free regime by leveraging Lidar depth maps as conditions within the denoising process. This enables competitive performance against trained baselines, especially for mobile-captured real-world blur scenarios (Montanaro et al., 11 Sep 2025).

6. Limitations and Open Directions

Depth Sensing Limitations: Available mobile Lidar or ToF devices have restricted depth ranges and resolutions; missing returns or noisy data (at transparent/reflective surfaces) require inpainting, which may introduce artifacts (Yi et al., 2024, Yi et al., 7 Jan 2026).
Domain Gaps: Synthetic datasets built for learned deblurring often have bias in PSFs and do not capture real-lens, field-dependent aberrations unless carefully modeled (Yang et al., 1 Jul 2025).
Textureless Regions: Feature-based fusion with depth provides limited benefit in areas with very low texture (DPDNet, MCCNet), as disparity and classical stereo cues become unreliable (Swami, 16 Feb 2025).
Sensitivity to Depth Quality: Quantitative gains are maximal only when accurate depth maps are provided; monocular depth estimation supplementary to RGB can improve regularization but is outperformed by true sensor data (Yi et al., 2024).
Future Research Vectors:
- Joint refinement of depth and image in self-consistent networks
- Integrating temporal cues from Lidar or event sensors for dynamic scene deblurring
- Addressing highly non-uniform blur due to fast, complex motion paths
- Incorporating explicit perceptual or geometric consistency losses
- Construction of real datasets pairing natural defocus blur with accurate depth and all-in-focus ground truth (Nazir et al., 2023, Montanaro et al., 11 Sep 2025)

Depth-guided deblurring demonstrably outperforms both classical maximum a posteriori (MAP), variational Bayes (VB), and purely 2D deep-learning models in handling spatially-varying blur kernels, especially under real-world conditions involving significant depth scatter. Recent methods, such as universal adapters and wavelet-domain light models, balance high deblurring fidelity with order-of-magnitude reductions in runtime or memory, making real-time deployment feasible (Yi et al., 2024, Yi et al., 7 Jan 2026). Joint frameworks (depth, motion, super-resolution) further emphasize the coupling of geometry and blur physics for maximal restoration accuracy (Park et al., 2017). In multi-view and LF regimes, depth-aware angular attention is essential for maintaining correspondence and resolving complex blur interactions (Shen et al., 2023).

A plausible implication is that, as depth sensors and disparity cues become more routinely available in mobile devices, future deblurring systems will move towards tightly-coupled, geometry-aware architectures that leverage the full spatiotemporal scene structure for image restoration.