Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

DiffCamera: Arbitrary Refocusing on Images (2509.26599v1)

Published 30 Sep 2025 in cs.CV

Abstract: The depth-of-field (DoF) effect, which introduces aesthetically pleasing blur, enhances photographic quality but is fixed and difficult to modify once the image has been created. This becomes problematic when the applied blur is undesirable~(e.g., the subject is out of focus). To address this, we propose DiffCamera, a model that enables flexible refocusing of a created image conditioned on an arbitrary new focus point and a blur level. Specifically, we design a diffusion transformer framework for refocusing learning. However, the training requires pairs of data with different focus planes and bokeh levels in the same scene, which are hard to acquire. To overcome this limitation, we develop a simulation-based pipeline to generate large-scale image pairs with varying focus planes and bokeh levels. With the simulated data, we find that training with only a vanilla diffusion objective often leads to incorrect DoF behaviors due to the complexity of the task. This requires a stronger constraint during training. Inspired by the photographic principle that photos of different focus planes can be linearly blended into a multi-focus image, we propose a stacking constraint during training to enforce precise DoF manipulation. This constraint enhances model training by imposing physically grounded refocusing behavior that the focusing results should be faithfully aligned with the scene structure and the camera conditions so that they can be combined into the correct multi-focus image. We also construct a benchmark to evaluate the effectiveness of our refocusing model. Extensive experiments demonstrate that DiffCamera supports stable refocusing across a wide range of scenes, providing unprecedented control over DoF adjustments for photography and generative AI applications.

Summary

  • The paper introduces a diffusion-transformer framework that enables arbitrary post-capture refocusing through simulated bokeh pairs and stacking constraints.
  • It employs depth dropout and a stacking constraint to improve DoF consistency and robustness against inaccurate depth maps, ensuring precise focus and blur control.
  • Extensive evaluations demonstrate superior performance in refocusing, bokeh addition, and deblurring, matching state-of-the-art semantic and error metrics.

DiffCamera: Arbitrary Refocusing on Images

Introduction and Motivation

DiffCamera addresses the longstanding challenge of post-capture depth-of-field (DoF) manipulation in single images. Traditional DoF effects, such as bokeh, are fixed at capture time and difficult to modify, limiting both photographic flexibility and downstream generative applications. Existing computational photography methods for refocusing typically require specialized hardware (e.g., light-field or focal sweep cameras) or are limited to all-in-focus images and heavily reliant on accurate depth maps, which are often unavailable or unreliable. DiffCamera proposes a diffusion-transformer-based framework that enables arbitrary refocusing on any image, conditioned on a user-specified focus point and bokeh level, without the need for specialized capture setups.

Methodology

Data Simulation Pipeline

A core challenge in training a refocusing model is the lack of real-world datasets containing perfectly aligned image pairs with varying focus planes and bokeh levels. DiffCamera circumvents this by simulating such data from all-in-focus images. The pipeline involves:

  • Collecting a diverse set of all-in-focus images from real-world photographs, phone-captured photos, and AI-generated images.
  • Estimating depth maps using Depth Anything V2.
  • Generating multiple bokeh variants per image using BokehMe, systematically varying the focus plane and bokeh level.

This simulation enables the creation of large-scale, perfectly aligned DoF image pairs, facilitating supervised learning of the refocusing task. Figure 1

Figure 1: Data collection pipeline. Pairs are simulated on all-in-focus images with different focus planes and blur levels.

Diffusion Transformer Architecture

The model architecture is a diffusion transformer that operates in the latent space. The inputs are:

  • Reference image IRI_R (encoded via a frozen VAE encoder)
  • Target camera conditions: focus point (fx,fy)(f_x, f_y) and bokeh level bb (projected into a learnable camera token)
  • Predicted depth map DD (optionally, see below)
  • Noisy latent ITtI_T^t at diffusion timestep tt

All inputs are tokenized and concatenated, with full attention modeling the relationships between image content and camera parameters. The model predicts the velocity vv in the rectified flow framework, enabling iterative denoising to the target refocused image. Figure 2

Figure 2: DiffCamera pipeline. Left: image and camera conditions are tokenized and input to a diffusion transformer. Right: stacking constraint learning objective, with shared weights and frozen VAEs.

Stacking Constraint

A vanilla diffusion objective is insufficient for precise DoF control, often resulting in incorrect focus or blur placement. DiffCamera introduces a stacking constraint inspired by photographic focus stacking: images focused at different planes can be linearly blended into a multi-focus image using a sharpness-based mask. This constraint is implemented as an additional loss in the latent space, enforcing that the model's predictions for different focus conditions can be composited to yield a physically plausible multi-focus result. This regularization significantly improves DoF consistency and adherence to camera conditions. Figure 3

Figure 3: Illustration of focus stacking. Two images focused at different planes are stacked using a mask derived from Laplacian sharpness maps.

Depth Dropout

To mitigate over-reliance on potentially inaccurate depth maps, a depth dropout mechanism is introduced: during training, 50% of depth maps are replaced with zeros. This forces the model to learn to infer DoF effects both with and without explicit depth information, enhancing robustness to depth estimation errors and generalization to real-world images. Figure 4

Figure 4: Qualitative studies on depth dropout. Depth dropout improves robustness to inaccurate depth maps, outperforming BokehMe and the model variant without dropout.

Experimental Evaluation

Benchmark and Metrics

A new benchmark is constructed, comprising 150 scenes (60 camera photos, 30 phone photos, 60 AI-generated images), each with multiple refocus, bokeh-adding, and deblurring samples. Evaluation metrics include:

  • MAE: pixel-wise error to simulated ground truth
  • LVCorr: Laplacian variance correlation with target bokeh level
  • CLIP-I: semantic consistency with the reference image
  • CLIP-IQA: perceptual image quality
  • LPIPS, PSNR: for deblurring

Comparative Results

DiffCamera is compared against GPT-4o (DALLE3) for refocusing and bokeh addition, and against Restormer for deblurring. Notable findings:

  • Refocusing: DiffCamera achieves CLIP-I of 0.954 vs. 0.859 for GPT-4o, and MAE of 0.025 vs. 0.138, indicating superior semantic consistency and lower error.
  • Bokeh Addition: LVCorr of 0.920, demonstrating accurate control over blur level.
  • Deblurring: Outperforms Restormer and GPT-4o in MAE, LPIPS, and CLIP-IQA, generating sharper and more consistent content in previously blurred regions. Figure 5

    Figure 5: Qualitative comparisons on refocusing and adding bokeh. DiffCamera performs refocusing on images with strong defocus blur, setting blur to zero and fixing the focus point at the center.

    Figure 6

    Figure 6: Qualitative comparisons on bokeh removing (deblur). DiffCamera is compared with Restormer and GPT-4o.

    Figure 7

    Figure 7: More comparisons. DiffCamera is compared with a deconvolution-based refocusing method and a two-stage Restormer+BokehMe pipeline.

Ablation Studies

  • Stacking Constraint: Removing the stacking constraint degrades LVCorr and overall DoF consistency, confirming its necessity for precise control. Figure 8

    Figure 8: Qualitative ablation on stacking constraint. Without it, the model fails to focus or blur the correct regions.

  • Depth Dropout: Models trained without depth dropout are more sensitive to depth map errors, leading to artifacts in challenging cases.
  • Data Composition: Adaptive balancing between real and AI-synthesized images during training yields the best performance across all tasks.

Qualitative Results and Generalization

DiffCamera demonstrates robust refocusing on arbitrary focus points and bokeh levels, even in real-world photos with naturally occurring bokeh. The model maintains high scene consistency and generates plausible sharp content in previously blurred regions, despite the ill-posed nature of deblurring. Figure 9

Figure 9: More visualizations on refocus. DiffCamera can refocus on arbitrary points with designated blur levels, even if the subject is originally blurry.

Figure 10

Figure 10: Refocusing with different blur levels, maintaining high scene consistency.

Figure 11

Figure 11: Application to real-world photos with natural bokeh.

Limitations and Future Directions

  • Resolution and Aspect Ratio: Current training is limited to 512×512512\times512 and 1024×10241024\times1024 resolutions. Scaling to higher resolutions and diverse aspect ratios is a natural extension.
  • Ill-posed Deblurring: Generating sharp content from heavily blurred regions remains fundamentally ambiguous; incorporating additional reference images could address identity preservation.
  • Finer Bokeh Control: Current conditioning is limited to focus point and blur level. Extending to bokeh shape and style is feasible by augmenting the data and conditioning tokens.
  • Depth Robustness: While depth dropout improves robustness, further augmentation (e.g., noise injection) could enhance performance under severe depth map errors.

Conclusion

DiffCamera establishes a new paradigm for post-capture DoF manipulation in single images, leveraging a diffusion transformer trained on simulated bokeh pairs, a physically grounded stacking constraint, and depth dropout for robustness. The method achieves state-of-the-art performance in arbitrary refocusing, bokeh addition, and deblurring, with strong generalization to real-world and synthetic images. The approach provides a foundation for future research in controllable photographic editing and generative AI, with potential extensions to higher resolutions, richer camera controls, and integration with user-provided references.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

alphaXiv