DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model (2408.07541v1)

Published 14 Aug 2024 in cs.CV, cs.AI, and eess.IV

Abstract: The flat lensless camera design reduces the camera size and weight significantly. In this design, the camera lens is replaced by another optical element that interferes with the incoming light. The image is recovered from the raw sensor measurements using a reconstruction algorithm. Yet, the quality of the reconstructed images is not satisfactory. To mitigate this, we propose utilizing a pre-trained diffusion model with a control network and a learned separable transformation for reconstruction. This allows us to build a prototype flat camera with high-quality imaging, presenting state-of-the-art results in both terms of quality and perceptuality. We demonstrate its ability to leverage also textual descriptions of the captured scene to further enhance reconstruction. Our reconstruction method which leverages the strong capabilities of a pre-trained diffusion model can be used in other imaging systems for improved reconstruction results.

Summary

The paper introduces DifuzCam which replaces traditional camera lenses with an amplitude mask and diffusion model to achieve high-quality image reconstruction.
Methodology utilizes a ControlNet-guided pre-trained diffusion model along with separable transformations to convert raw sensor data into detailed images.
Evaluation shows state-of-the-art performance with improved PSNR, SSIM, and LPIPS metrics, setting a new standard for lensless flat camera systems.

DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model

This paper introduces DifuzCam, an innovative approach to computational photography, specifically targeting the challenge of lensless flat cameras. The proposed method replaces traditional camera lenses with a diffuser or amplitude mask, significantly reducing size and weight. A pre-trained diffusion model is utilized with a ControlNet network and learned separable transformations to reconstruct high-quality images from raw sensor measurements.

Introduction

Flat cameras using amplitude masks enable substantial camera miniaturization but face challenges in reconstructing visually understandable images. Existing methods—direct optimization and deep learning—have not achieved satisfactory quality in reconstruction. DifuzCam proposes leveraging diffusion models as strong image priors for natural images, enhancing reconstruction quality by utilizing both image and text guidance.

Image of Prototype Flat Camera:

Figure 1: A compact prototype flat camera designed with this approach.

Methodology

Optical Design

The DifuzCam utilizes a flat camera design incorporating a separable amplitude mask printed on a chrome plate using lithography. Each mask feature is precisely constructed to allow optimal interaction with the captured image's light projections.

Mask Pattern Assessment:

Figure 2: Visualization of the mask pattern used in the flat camera prototype.

Reconstruction with Diffusion Models

A separable linear transformation converts multiplexed sensor measurements to pixel space, crucial for guiding the diffusion model, trained on vast natural image data. A ControlNet network is adapted to control the diffusion model during image generation, facilitating image recovery using pre-trained UNet architectures with zero convolutions to maintain diffusion performance.

DifuzCam System Overview:

Figure 3: The DifuzCam reconstruction process using ControlNet and diffusion models.

Evaluation

The DifuzCam achieves state-of-the-art results, outperforming existing methods like Tikhonov and FlatNet in various metrics, including PSNR, SSIM, and LPIPS, with CLIP scores demonstrating effective image-text adherence. The model exhibits improved perception and textual alignment by integrating optional text descriptions during the reconstruction phase.

Implementation Details

The dataset consisted of images and their captions from LAION-aesthetics, captured using the DifuzCam prototype. Training was conducted over 500k steps with a pre-trained stable diffusion model, emphasizing the enhancement through textual guidance to align reconstructions closely with the actual scene complexity.

Conclusion

DifuzCam introduces an advanced image reconstruction technique through lensless camera technology, enhancing its practical application in imaging systems. The approach combines robust diffusion model priors with innovative text guidance, overcoming prior limitations and setting new standards in flat camera performance. This methodology holds potential for adaptation across diverse imaging contexts, fueling further innovation in computational photography.

The paper suggests scalability of the DifuzCam architecture to other lensless systems, potentially revolutionizing compact device photography with increased versatility and precision in image rendering.