Papers
Topics
Authors
Recent
2000 character limit reached

DifuzCam: Diffusion Flat Lensless Camera

Updated 7 November 2025
  • DifuzCam is a flat lensless camera system that replaces conventional lenses with a thin amplitude mask to capture multiplexed sensor measurements.
  • It leverages latent diffusion models integrated with ControlNet for text-guided and semantically robust image reconstruction from ill-posed data.
  • Experimental results show improved PSNR, SSIM, and CLIP scores, validating DifuzCam’s effective reconstruction across 2D and 3D scenes.

DifuzCam is a flat lensless camera system that replaces the conventional optical lens with a thin amplitude mask positioned directly onto the sensor. It utilizes a reconstruction algorithm powered by pre-trained diffusion models—specifically latent diffusion models (LDM)—along with separable transforms and a conditional control network. The objective is to achieve state-of-the-art image quality and perceptuality in reconstructions from visually unintelligible raw multiplexed sensor measurements. DifuzCam demonstrates robust semantic and structural image formation, introduces text-guided scene refinement, and generalizes to 2D and 3D scenes and various camera modalities (Yosef et al., 14 Aug 2024).

1. Motivation for Flat Lensless Camera and Diffusion-based Reconstruction

Conventional cameras focus incoming light using lenses onto sensors, yielding sharp images. However, physical constraints limit their miniaturization—lenses add bulk and restrict potential thickness. Flat lensless cameras address this by replacing the lens with a lithographically fabricated binary amplitude mask (outer product of M-sequence signals; 25 μm features), which multiplexes scene information onto the sensor. The resulting sensor data YY does not resemble the original scene; high-quality inverse recovery is fundamentally ill-posed and sensitive to hardware imperfections.

Previous approaches—direct optimization, FlatNet (multi-layer perceptron), GANs, and image transformers—suffered from visible artifacts, semantic drift, and sensitivity to nonideal calibration. DifuzCam advances this paradigm by employing pre-trained generative diffusion priors and conditional adapters to render reconstruction tractable and perceptually robust.

2. Camera Hardware and Mathematical Forward Model

DifuzCam adopts the FlatCam architecture, replacing lenses with a thin separable mask:

  • Mask: Binary amplitude, placed onto CMOS sensor.
  • Imaging Model: Sensor output from a scene XX is given by:

Y=ΦlXΦrY = \Phi_l X \Phi_r

where Φl\Phi_l and Φr\Phi_r are left/right separable transformations encoded by the mask and the measurement process.

  • Color Channels: Each channel CkC_k is processed via learned separable matrices:

Cko=ϕlkCkϕrkC_k^o = \phi_l^k\,C_k\,\phi_r^k

The full color measurement is shaped as CoR4×H×WC^o \in \mathbb{R}^{4 \times H \times W}.

Calibration is essential for realistic measurement: the hardware pipeline must account for imperfect mask fabrication, PSF nonlinearities, diffraction effects, and alignment tolerances. Data acquisition comprises screen projections and 3D real scene photographs.

3. Diffusion Model as Natural Image Prior

Diffusion models, particularly Stable Diffusion (Latent Diffusion Model, LDM), are stochastic generative models that implicitly learn a rich natural image prior by denoising random Gaussian noise into sample images. DifuzCam leverages this denoising prior to regularize the underdetermined inverse problem posed by flat lensless imaging.

ControlNet Integration: Conditional information from sensor measurements is injected into the diffusion pipeline by ControlNet, an adaptation of the UNet encoder extending the LDM backbone. This module learns to map the separable-masked sensor data into features compatible with the learned image prior:

  • ControlNet weights are trained to extract relevant conditional features; the main LDM weights remain fixed.
  • “Zero convolution” initialization ensures no degradation of pretrained performance.

Text-guided Conditionality: DifuzCam extends the LDM’s capacity for text-conditional synthesis by incorporating user-provided scene descriptions as semantic priors. Text tokens are embedded alongside sensor-derived features, enabling the reconstruction process to favor semantically consistent, realistic images even in the presence of ambiguous measurements. In cases of conflict, text prevails, highlighting the primacy of the semantic prior in underconstrained reconstruction.

4. Learning and Loss Functions

Training and adaptation in DifuzCam optimize the ControlNet weights and separable transform parameters:

  • Diffusion Training Loss:

lC=EE(x),ϵN(0,I),t[ϵϵθ(zt,t,y,Cψ(Co))2]l_\mathcal{C} = \mathbb{E}_{\mathcal{E}(x),\, \epsilon \sim \mathcal{N}(0,I),\, t}\left[\|\epsilon - \epsilon_\theta(z_t, t, y, \mathcal{C}_\psi(C^o))\|_2\right]

where ϵθ\epsilon_\theta is the LDM denoiser, ztz_t is the noisy latent at time tt, yy is optional text input, and Cψ\mathcal{C}_\psi is the ControlNet operator on CoC^o.

lsep=Ifconv(Co)2l_{sep} = \|I - f_{conv}(C^o)\|_2

with fconvf_{conv} a learnable 3×33 \times 3 convolution mapping CoC^o to RGB.

  • Latent Diffusion Loss (optional):

lldm=E[ϵϵθ(zt,t,y)]l_{ldm} = \mathbb{E}\left[\|\epsilon - \epsilon_\theta(z_t, t, y)\|\right]

Real data pairing RGB images with flat lens sensor measurements is used for robust training; simulated data alone yields inferior calibration and generalization.

5. Experimental Validation

DifuzCam achieves statistically significant improvement in all standard image reconstruction metrics compared to prior approaches:

Method PSNR SSIM LPIPS CLIP Score
FlatNet 18.6 0.51 0.32 21.7
DifuzCam 21.89 0.54 0.276 24.38

Visual reconstructions demonstrate substantial improvements in perceptual quality, color fidelity, and semantic structure. Text-guided reconstructions further enhance scene consistency, successfully modifying or specifying ambiguous context. Ablation studies reveal that omitting separable loss or textual guidance results in perceptual artifacts or drift; jointly optimized models are robust.

The approach generalizes to both synthetic projections and real 3D scene data.

6. Implications for Computational Imaging and Semantic Scene Reconstruction

DifuzCam establishes that coupling flat lensless hardware with deep generative priors (diffusion models + ControlNet) yields thin, robust, and perceptually accurate imaging devices. Unlike deterministic post-inverse algorithms, the generative approach regularizes ill-posed reconstructions, exploits semantic priors for ambiguity resolution, and allows dynamic conditioning (e.g., via descriptive text). This methodology is extensible across modalities: any inverse-imaging problem subject to severe multiplexing or data loss can theoretically benefit from a learned generative prior and a controlling conditional adapter.

A plausible implication is that next-generation computational cameras will rely not only on novel optics but on deep integration with generative modeling frameworks, expanding the domain of possible form-factors and functionalities well beyond traditional lens-based paradigms.

7. Broader Context within Lensless and Diffusion-Based Imaging

DifuzCam continues a sequence of advances originating with FlatCam [ICCVW 2015], Spectral DiffuserCam (Monakhova et al., 2020), and compressive diffuser cameras including single-pixel paradigms (Liu et al., 2021). Its integration of a pre-trained diffusion model and text-conditioned ControlNet for image formation marks a notable progression from prior feed-forward or GAN-based recovery, shifting the focus to data-driven semantic reconstruction.

While systems like Diffuser-mCam (Zheng et al., 18 Jul 2025) extend the lensless paradigm to full multi-modal 5D imaging (spatial, spectral, polarization, temporal), DifuzCam prioritizes spatial image quality and semantic interpretability via powerful priors. This suggests a bifurcation in lensless computational imaging: modality expansion vs. perceptual/prior enhancement—both avenues likely to coalesce in future systems as computational resources and modeling capacities scale.


References cited in source data: [FlatCam: ICCVW 2015]; [FlatNet: ICCV 2019]; [Stable Diffusion: Rombach et al., CVPR 2022]; [ControlNet: Zhang et al., ICCV 2023]; [Spectral DiffuserCam: (Monakhova et al., 2020)].

Whiteboard

Follow Topic

Get notified by email when new papers are published related to DifuzCam.