BokehNet: Diffusion-based Bokeh Synthesis
- BokehNet is a diffusion-based neural architecture for digital bokeh synthesis, enabling adjustable focus, blur level, and aperture shape from single all-in-focus images.
- It combines synthetic paired pretraining with semi-supervised fine-tuning on real images to accurately model real-world optical effects and achieve superior DoF control.
- Leveraging a latent diffusion backbone with ControlNet and DiT components, BokehNet outperforms traditional simulators on metrics like LPIPS, DISTS, and CLIP-I.
BokehNet is a diffusion-based neural architecture specializing in digital bokeh synthesis and controllable defocus rendering from single, all-in-focus RGB images, designed as the second stage in the Generative Refocusing pipeline. Its explicit goal is to generate high-fidelity shallow depth-of-field (DoF) images with user-adjustable properties, including focal plane, blur level, and (optionally) the aperture's geometric shape. BokehNet's innovation lies in its ability to harness both synthetic paired data and large-scale unpaired real bokeh photos—leveraged via EXIF-based optical parameter estimation—to accurately model and reproduce real-world bokeh characteristics that surpass classical simulators (Mu et al., 18 Dec 2025).
1. Architectural Overview and Conditioning Signals
BokehNet is built upon a latent diffusion + ControlNet backbone with a DiT (Diffusion Transformer) core and VAE-based encoder/decoder. The conditioning interface comprises three principal modalities:
- All-in-focus image (): The VAE-encoded output from the preceding DeblurNet, serving as the sharp scene baseline.
- Defocus map (): A per-pixel blur radius map, computed from the scene metric depth, a user-specified focus plane , and a global bokeh (aperture) level . .
- Aperture shape (): An optional binary mask (point spread function, PSF), e.g., circle, heart, or user-uploaded kernel, indicating the aperture geometry.
During synthesis, BokehNet is presented with arbitrary combinations of , enabling highly flexible, user-controlled DoF manipulation (Mu et al., 18 Dec 2025).
2. Training Methodology: Hybrid Synthetic and Semi-supervised Real Optimization
Synthetic Paired Pretraining
Synthetic data is generated by rendering bokeh images from all-in-focus inputs using classical algorithms (e.g., BokehMe or equivalent renderers). Monocular depth, random focus planes, aperture values, and shape kernels yield diverse synthetic pairs . This phase uses a standard diffusion reconstruction loss: establishing basic DoF control and physical plausibility.
Semi-supervised Fine-tuning on Real Photographs
Unpaired real bokeh images with EXIF metadata (e.g., focal length , f-number ) are used to capture in-the-wild optical effects absent from simulators. The pipeline:
- Use DeblurNet to generate proxy .
- Estimate scene depth () via monocular depth prediction.
- Extract the in-focus region mask (BiRefNet + manual filtering) to infer .
- Compute pixel-level bokeh strength via the circle-of-confusion formula:
- Assemble .
Supervision is a hybrid of perceptual and diffusion losses, without true-pixel correspondence, enabling the network to absorb real lens phenotypes, spatial variations, and specular effects.
For small paired collections lacking EXIF, a brute-force simulation-calibration is used: is varied over plausible ranges to maximize SSIM between rendered and observed bokeh, providing a supervisory signal for further fine-tuning.
3. Controllable Bokeh Generation: User and Interface Capabilities
BokehNet supports comprehensive user-driven refocusing with the following modalities:
- Interactive focus selection: The user chooses a focal plane S₁ via slider or click.
- Variable aperture control: Slider-based ranging from wide open (shallow DoF) to “infinity” (all-in-focus).
- Custom aperture shapes: Arbitrary PSFs are supported; user-provided small masks (e.g., heart, star) are injected through a ControlNet branch that modulates highlight scattering and out-of-focus region style.
- Resolution and aspect ratio: Trained and run as 512×512 patches with overlap tiling at inference to accommodate arbitrary input sizes.
4. Quantitative Performance, Benchmarks, and Qualitative Effects
BokehNet, as evaluated on standard datasets (e.g., LF-Bokeh, LF-Refocus), achieves state-of-the-art results for single-image bokeh synthesis and flexible refocusing. Key benchmark outcomes on held-out sets:
| Benchmark | LPIPS ↓ | DISTS ↓ | CLIP-I ↑ |
|---|---|---|---|
| LF-Bokeh, Ours | 0.1047 | 0.0611 | 0.9570 |
| LF-Bokeh, BokehMe | 0.1228 | 0.0744 | 0.9511 |
| LF-Refocus (ours) | 0.1458 | 0.0850 | 0.9451 |
Qualitatively, BokehNet demonstrates greater blur gradient realism, accurate occlusion, and the ability to replicate real lens artifacts. DeblurNet BokehNet composition enables restoration of lost detail followed by physically correct refocusing.
5. Data and Optimization Protocols
- Pretraining: 70,000 synthetic (AIF, bokeh) pairs for the diffusion backbone.
- Unpaired real fine-tuning: 27,000 bokeh images with EXIF and further manual curation for shape/diversity.
- Hardware: Multi-GPU (4×RTX A6000), batch size 1 per card, per-step gradient accumulation.
- Two-stage schedule: Synthetic for 40k steps, then real images for additional 40k steps. Heavy spatial data augmentation ensures robustness to focus and scene layout.
Parallel LoRA adapters of lower rank (64) are trained in the ControlNet branches for efficient domain adaptation. Full weights in the backbone remain frozen, promoting stable convergence.
6. Limitations and Plausible Implications
Known constraints include:
- Depth estimation errors: Significant monocular failures produce misplaced focus and halo artifacts, exposing reliance on .
- Aperture-shape generalization: While PSF-based control is supported, arbitrary shapes not adequately represented in training (e.g., complex hand-drawn shapes) may yield less plausible results.
- Joint optimization: The two-stage DeblurNet+BokehNet architecture, while modular and robust, does not propagate gradients end-to-end; potential improvements could arise from joint training.
- EXIF quality: EXIF-based K estimation assumes accurate metadata and sensor calibration. Errors in these may propagate into inaccurate bokeh strength during real-image supervision.
This suggests that broadening depth backbones, richer synthetic PSF augmentation, and joint end-to-end optimization are promising future research directions (Mu et al., 18 Dec 2025).
7. Relation to Prior and Contemporary Work
BokehNet distinguishes itself from classical simulators and prior neural methods by:
- Decoupling focus restoration (DeblurNet) from bokeh generation, supporting both in-one and two-step workflows.
- Achieving high-fidelity bokeh via semi-supervised learning on real optical effects, not just synthetic renderings.
- Enabling arbitrary PSF shape control and text guidance for deblurring, facilitating expressive and realistic digital DoF adjustability.
- Outperforming “pure” learned and simulation-based pipelines on standard metrics (LPIPS, DISTS, CLIP-I, MUSIQ, NIQE) and generalizing well across both synthetic and real-world evaluation sets (Mu et al., 18 Dec 2025).