Generative Refocusing: Flexible Defocus Control from a Single Image (2512.16923v1)

Published 18 Dec 2025 in cs.CV

Abstract: Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

Summary

The paper introduces a two-stage, diffusion-based architecture that decouples defocus deblurring (DeblurNet) from controllable bokeh synthesis (BokehNet).
It demonstrates superior performance on benchmarks with enhanced detail recovery and realistic blur gradients through semi-supervised training on synthetic and real unpaired data.
The method offers flexible user control over focus plane, bokeh strength, and aperture shape, paving the way for advanced post-capture depth-of-field editing.

Generative Refocusing: Flexible Defocus Control from a Single Image

Introduction

"Generative Refocusing: Flexible Defocus Control from a Single Image" (2512.16923) presents a two-stage, diffusion-based paradigm for single-image refocusing, overcoming long-standing limitations in control and realism in post-capture depth-of-field (DoF) editing. The framework—comprising DeblurNet and BokehNet modules—enables flexible input handling (arbitrary focus states), comprehensive user control (focus plane, bokeh strength, and aperture shape), and elevated synthesis realism via semi-supervised training on synthetic and real unpaired data. The approach is validated on strong public and new benchmarks, outperforming state-of-the-art methods across deblurring, bokeh synthesis, and refocusing tasks.

Methodology

The system decomposes refocusing into two orthogonal subtasks: defocus deblurring and controllable bokeh synthesis. This decoupling allows specialized models for each operation, modularizing the learning and enabling precise user control.

Figure 1: The two-stage pipeline first recovers an all-in-focus image with DeblurNet and then applies parameterized, controllable bokeh using BokehNet.

DeblurNet: Defocus Deblurring

DeblurNet targets spatially varying defocus blur. It is conditioned on the potentially defocused input $I_\text{in}$ and, optionally, a pre-deblurred estimate $I_\text{pd}$ from a classical restoration method. Dual-conditioning is positional: $I_\text{in}$ and $I_\text{pd}$ are encoded with distinct spatial grids, and a random dropout regularizes $I_\text{pd}$ to ensure robustness to its artifacts. The diffusion-based prior enables reconstruction of high-frequency detail otherwise collapsed by deterministic methods.

BokehNet: Controllable Bokeh Synthesis

BokehNet accepts the all-in-focus reconstruction, a (potentially user-edited) defocus map parameterized by estimated or user-provided depth and focus plane, bokeh level (strength), and an optional aperture shape kernel. This supports aperture size, shape, and arbitrary focus position control. Semi-supervised learning leverages both

Synthetic paired data (all-in-focus image, depth, focus, and aperture)—providing ground-truth geometry and controlled variation.
Unpaired real-world bokeh images with EXIF metadata—capturing real lens/optics characteristics via EXIF-informed parameter regression and expert-guided focus masking.
Figure 2: Training set synthesis leverages both controlled rendering and real-world images with parameter extraction, enabling physically plausible, high-fidelity bokeh learning.

Experimental Results

Defocus Deblurring

On DPDD and RealDOF benchmarks, DeblurNet achieves the strongest metrics (LPIPS, FID, CLIP-IQA, etc.), outperforming transformer-based and implicit representation models. Visual comparisons show superior geometrics and fidelity, particularly in text restoration and challenging, high-variance blur regions.

Figure 3: Qualitative results on deblurring benchmarks show finer detail and geometry consistency compared to top-performing baselines.

Bokeh Synthesis

On the new LF-Bokeh benchmark, BokehNet surpasses both physics-based and neural bokeh renderers (LPIPS 0.1047 vs. 0.1228–0.1799). The model better preserves blur gradients, occlusion boundaries, and adheres more closely to real lens behavior especially when trained with unpaired real data.

Figure 4: Zoomed-in comparisons against multiple baselines highlight more realistic blur placement and intensity scaling.

Refocusing

For the complete refocusing pipeline, GenRefocus outperforms all pairwise combinations of top all-in-focus estimators (e.g., DRBNet, Restormer) with neural or classical bokeh synthesis modules, reflecting the advantage of joint training with real data and modular design.

Ablation Studies

A significant performance gap is found between the two-stage and one-stage (direct mapping) designs, with the two-stage notably stronger due to improved depth-control and tailored semi-supervised learning for each subtask. Incorporating real, unpaired bokeh data substantially boosts perceptual and fidelity metrics compared to purely simulated supervision.

Controllable Aperture and Text-Guided Applications

Shape-aware bokeh synthesis is supported via explicit conditioning on a shape kernel. The model is fine-tuned with point-light-elicited training sets to maximize responsiveness. The system also demonstrates the capacity for text-guided restoration in DeblurNet, where prompts can correct hallucinated or ambiguous text in reconstructions.

Figure 5: Example images demonstrating user-specified aperture shape control during bokeh synthesis (triangle, heart, star).

Figure 6: Results on text-guided deblurring—text prompts at inference rectify content that would otherwise be mistranslated due to blur.

Implications and Future Directions

GenRefocus introduces a scalable and modular approach for post-capture focus and bokeh control, thus bridging artistic and technical requirements in computational photography and rendering. By decoupling restoration from rendering, the architecture accommodates flexible data sources, individualized subtask regularization, and exhaustive control signals. The use of real unpaired bokeh images (with EXIF) marks a crucial advance in capturing physical camera effects previously unattainable with simulator-only pipelines.

Practical implications include:

Enhanced post-capture editing for consumers and professionals, with fine-grained DoF and bokeh styling.
Data-driven understanding of real camera optics for neural rendering domains.
Prompt synergy with vision-LLMs for informed content disambiguation and editing.

Theoretical implications extend to the integration of modular, semi-supervised pipelines for underdetermined physical phenomena (e.g., vision-conditioned rendering).

A limitation is reliance on monocular depth estimation, which may degrade under severe blur. Generalization to complex, user-drawn aperture shapes requires further curated simulation. Future directions involve robustifying depth estimation and expanding the vocabulary of controllable optical effects within diffusion-based refocusing.

Conclusion

Generative Refocusing (2512.16923) establishes a new paradigm for flexible, high-fidelity DoF and bokeh manipulation from single images. The two-stage architecture, underpinned by a semi-supervised strategy and explicit control over optical parameters, achieves superior quantitative and qualitative performance, offering a foundation for physically-plausible post-capture image editing and future vision–language-guided pipelines.