DeblurNet: All-in-Focus Image Recovery
- DeblurNet is a network that recovers all-in-focus images from defocused inputs by leveraging dual input encoding and transformer-based denoising.
- It employs positional disentanglement and pre-deblur dropout to robustly fuse observed and auxiliary images, significantly enhancing image sharpness.
- Integration with BokehNet enables virtual refocusing and bokeh synthesis, outperforming monolithic deblurring systems on standard benchmarks.
DeblurNet is the first-stage network in the Generative Refocusing framework, designed to recover all-in-focus imagery from input photographs exhibiting optical defocus or bokeh artifacts. It constitutes a core component in a two-step architecture, preceding bokeh synthesis (BokehNet), with the primary objective of extracting sharp signal from spatially-varying blur while remaining robust to imperfect guidance from auxiliary deblurring algorithms (Mu et al., 18 Dec 2025).
1. Problem Formulation and Role in Generative Refocusing
DeblurNet addresses the inherently ill-posed single-image deblurring problem in the context of depth-of-field (DoF) manipulation. Given a defocused or bokeh image , possibly accompanied by a "pre-deblur" image generated by an existing deep restoration CNN (e.g., DRBNet), the goal is to reconstruct an all-in-focus estimate that preserves spatial detail across a wide range of focal depths.
Unlike conventional blind deblurring systems—which operate solely on a single input—DeblurNet utilizes both an observed defocused input and a proxy image from a non-specialized restoration pipeline. This two-input paradigm is motivated by the need to mitigate hallucination and bias effects when the pre-deblurred image is artifact-prone or misguides the reconstruction process.
DeblurNet, when used together with BokehNet, enables end-to-end virtual refocusing. The all-in-focus output from DeblurNet provides the latent source for photorealistic shallow-DoF rendering, custom aperture effects, and refocusing at arbitrary depths (Mu et al., 18 Dec 2025).
2. Network Architecture and Innovation
DeblurNet builds on the latent-space DiT (Diffusion Transformer) backbone deployed in FLUX models, integrating architecture-level mechanisms to enforce robustness and disentanglement:
- Input Encoding and Positional Disentanglement: Both and are processed via a VAE encoder to yield latent tokens. Rather than naïvely concatenating these, DeblurNet tiles the two latent grids with distinct shifts in the positional encoding space (termed "positional disentanglement"). This procedure counters the risk of the model overfitting to the auxiliary input and forces explicit reasoning about uncertainty.
- Pre-Deblur Dropout: To further suppress dependence on and encourage signal extraction from , the system randomly masks out the channel during training, simulating scenarios where auxiliary restoration fails or is absent.
- Transformer-based Denoising: The concatenated latent tokens are fed into a DiT backbone , which operates as the denoising engine in the DDPM diffusion process. This transformer, conditioned on the positionalized latents, predicts the next-step latent in the diffusion chain.
- VAE Decoding: The final latent output is decoded back to image space using a VAE decoder , producing the all-in-focus estimate .
3. Training Objective and Diffusion Process
DeblurNet is trained using the standard DDPM (Denoising Diffusion Probabilistic Model) objective. Specifically:
where denotes the noisy latent representation at timestep , and are the positionalized latents derived from the input pair . This stochastic denoising loss is optimized to minimize the distance between the predicted and true noise tensors over multiple diffusion steps.
Training data comprises (i) paired dual-pixel defocus/clear images (DPDD, RealBokeh_3MP) suitable for sharpness supervision, and (ii) diversity-enhancing augmentations such as random cropping and color jitter (Mu et al., 18 Dec 2025).
4. Robustness Strategies and Input Handling
To safeguard against over-reliance on externally supplied pre-deblurred images, DeblurNet’s positional disentanglement strategy and dropout policy are engineered to isolate useful signal while ignoring distractor features. During inference, both and are provided, but the network is resilient even when is of inferior quality, a plausible implication being improved fault tolerance in practical deployment.
Empirically, this design yields marked improvements over naïve fusion or single-image restoration baselines both in perceptual metrics (LPIPS, FID) and in context-dependent downstream tasks such as bokeh synthesis (Mu et al., 18 Dec 2025).
5. Integration with Downstream Refocusing and Synthesis
Once DeblurNet reconstructs , it is used as the latent source for BokehNet, which renders new defocus effects, adjustable aperture shapes, and focus planes. Crucially, end-to-end refocusing performance depends on DeblurNet’s ability to generate artifact-free all-in-focus images with high fidelity to the underlying scene.
Ablation studies confirm that two-stage refocusing (DeblurNet + BokehNet) outperforms monolithic architectures that attempt to jointly deblur and synthesize bokeh (LPIPS; RealDOF: FID), and that the decoupled approach facilitates more controllable, physically grounded DoF manipulations (Mu et al., 18 Dec 2025).
6. Evaluation and Limitations
DeblurNet demonstrates state-of-the-art results on standard deblurring and refocusing benchmarks:
- RealDOF dataset: LPIPS 0.524 → 0.2356, FID 92.93 → 24.73.
- DPDD dataset: LPIPS 0.348 → 0.1598, FID 88.85 → 33.08.
- CLIP-IQA: 0.356 → 0.4575 (RealDOF), 0.4337 → 0.4619 (DPDD).
These improvements establish DeblurNet as a leading approach within multi-stage generative refocusing pipelines.
Limitations center on monocular depth estimation reliability—failure modes in depth prediction propagate into mislocalized deblurring—and on the handling of complex or ultra-high-resolution scenes, which may require further architectural adaptation or up-training procedures (Mu et al., 18 Dec 2025).
7. Context and Significance
DeblurNet advances the field of computational photography and generative imaging by operationalizing robust all-in-focus recovery under challenging defocus conditions. Its technical innovations—joint latent encoding, positional disentanglement, and auxiliary dropout—enhance downstream bokeh synthesis and customized refocusing, supporting new applications in both automated bokeh rendering and interaction-driven photographic manipulation.
The method’s semi-supervised training regime, leveraging both synthetic perfect pairs and real unpaired examples through EXIF-based optical calibration, suggests generalization potential to broader hardware and scene configurations. A plausible implication is wide applicability in consumer imaging, scientific microscopy, and any context requiring post-capture refocus with realistic optical properties (Mu et al., 18 Dec 2025).