DifuzCam: Diffusion Flat Lensless Camera

Updated 7 November 2025

DifuzCam is a flat lensless camera system that replaces conventional lenses with a thin amplitude mask to capture multiplexed sensor measurements.
It leverages latent diffusion models integrated with ControlNet for text-guided and semantically robust image reconstruction from ill-posed data.
Experimental results show improved PSNR, SSIM, and CLIP scores, validating DifuzCam’s effective reconstruction across 2D and 3D scenes.

DifuzCam is a flat lensless camera system that replaces the conventional optical lens with a thin amplitude mask positioned directly onto the sensor. It utilizes a reconstruction algorithm powered by pre-trained diffusion models—specifically latent diffusion models (LDM)—along with separable transforms and a conditional control network. The objective is to achieve state-of-the-art image quality and perceptuality in reconstructions from visually unintelligible raw multiplexed sensor measurements. DifuzCam demonstrates robust semantic and structural image formation, introduces text-guided scene refinement, and generalizes to 2D and 3D scenes and various camera modalities (Yosef et al., 14 Aug 2024).

1. Motivation for Flat Lensless Camera and Diffusion-based Reconstruction

Conventional cameras focus incoming light using lenses onto sensors, yielding sharp images. However, physical constraints limit their miniaturization—lenses add bulk and restrict potential thickness. Flat lensless cameras address this by replacing the lens with a lithographically fabricated binary amplitude mask (outer product of M-sequence signals; 25 μm features), which multiplexes scene information onto the sensor. The resulting sensor data $Y$ does not resemble the original scene; high-quality inverse recovery is fundamentally ill-posed and sensitive to hardware imperfections.

Previous approaches—direct optimization, FlatNet (multi-layer perceptron), GANs, and image transformers—suffered from visible artifacts, semantic drift, and sensitivity to nonideal calibration. DifuzCam advances this paradigm by employing pre-trained generative diffusion priors and conditional adapters to render reconstruction tractable and perceptually robust.

2. Camera Hardware and Mathematical Forward Model

DifuzCam adopts the FlatCam architecture, replacing lenses with a thin separable mask:

Mask: Binary amplitude, placed onto CMOS sensor.
Imaging Model: Sensor output from a scene $X$ is given by:

$Y = \Phi_l X \Phi_r$

where $\Phi_l$ and $\Phi_r$ are left/right separable transformations encoded by the mask and the measurement process.

Color Channels: Each channel $C_k$ is processed via learned separable matrices:

$C_k^o = \phi_l^k\,C_k\,\phi_r^k$

The full color measurement is shaped as $C^o \in \mathbb{R}^{4 \times H \times W}$ .

Calibration is essential for realistic measurement: the hardware pipeline must account for imperfect mask fabrication, PSF nonlinearities, diffraction effects, and alignment tolerances. Data acquisition comprises screen projections and 3D real scene photographs.

3. Diffusion Model as Natural Image Prior

Diffusion models, particularly Stable Diffusion (Latent Diffusion Model, LDM), are stochastic generative models that implicitly learn a rich natural image prior by denoising random Gaussian noise into sample images. DifuzCam leverages this denoising prior to regularize the underdetermined inverse problem posed by flat lensless imaging.

ControlNet Integration: Conditional information from sensor measurements is injected into the diffusion pipeline by ControlNet, an adaptation of the UNet encoder extending the LDM backbone. This module learns to map the separable-masked sensor data into features compatible with the learned image prior:

ControlNet weights are trained to extract relevant conditional features; the main LDM weights remain fixed.
“Zero convolution” initialization ensures no degradation of pretrained performance.

Text-guided Conditionality: DifuzCam extends the LDM’s capacity for text-conditional synthesis by incorporating user-provided scene descriptions as semantic priors. Text tokens are embedded alongside sensor-derived features, enabling the reconstruction process to favor semantically consistent, realistic images even in the presence of ambiguous measurements. In cases of conflict, text prevails, highlighting the primacy of the semantic prior in underconstrained reconstruction.

4. Learning and Loss Functions

Training and adaptation in DifuzCam optimize the ControlNet weights and separable transform parameters:

Diffusion Training Loss:

$l_\mathcal{C} = \mathbb{E}_{\mathcal{E}(x),\, \epsilon \sim \mathcal{N}(0,I),\, t}\left[\|\epsilon - \epsilon_\theta(z_t, t, y, \mathcal{C}_\psi(C^o))\|_2\right]$

where $\epsilon_\theta$ is the LDM denoiser, $z_t$ is the noisy latent at time $t$ , $y$ is optional text input, and $\mathcal{C}_\psi$ is the ControlNet operator on $C^o$ .

Separable Transformation Loss:

$l_{sep} = \|I - f_{conv}(C^o)\|_2$

with $f_{conv}$ a learnable $3 \times 3$ convolution mapping $C^o$ to RGB.

Latent Diffusion Loss (optional):

$l_{ldm} = \mathbb{E}\left[\|\epsilon - \epsilon_\theta(z_t, t, y)\|\right]$

Real data pairing RGB images with flat lens sensor measurements is used for robust training; simulated data alone yields inferior calibration and generalization.

5. Experimental Validation

DifuzCam achieves statistically significant improvement in all standard image reconstruction metrics compared to prior approaches:

Method	PSNR	SSIM	LPIPS	CLIP Score
FlatNet	18.6	0.51	0.32	21.7
DifuzCam	21.89	0.54	0.276	24.38

Visual reconstructions demonstrate substantial improvements in perceptual quality, color fidelity, and semantic structure. Text-guided reconstructions further enhance scene consistency, successfully modifying or specifying ambiguous context. Ablation studies reveal that omitting separable loss or textual guidance results in perceptual artifacts or drift; jointly optimized models are robust.

The approach generalizes to both synthetic projections and real 3D scene data.

6. Implications for Computational Imaging and Semantic Scene Reconstruction

DifuzCam establishes that coupling flat lensless hardware with deep generative priors (diffusion models + ControlNet) yields thin, robust, and perceptually accurate imaging devices. Unlike deterministic post-inverse algorithms, the generative approach regularizes ill-posed reconstructions, exploits semantic priors for ambiguity resolution, and allows dynamic conditioning (e.g., via descriptive text). This methodology is extensible across modalities: any inverse-imaging problem subject to severe multiplexing or data loss can theoretically benefit from a learned generative prior and a controlling conditional adapter.

A plausible implication is that next-generation computational cameras will rely not only on novel optics but on deep integration with generative modeling frameworks, expanding the domain of possible form-factors and functionalities well beyond traditional lens-based paradigms.

7. Broader Context within Lensless and Diffusion-Based Imaging

DifuzCam continues a sequence of advances originating with FlatCam [ICCVW 2015], Spectral DiffuserCam (Monakhova et al., 2020), and compressive diffuser cameras including single-pixel paradigms (Liu et al., 2021). Its integration of a pre-trained diffusion model and text-conditioned ControlNet for image formation marks a notable progression from prior feed-forward or GAN-based recovery, shifting the focus to data-driven semantic reconstruction.

While systems like Diffuser-mCam (Zheng et al., 18 Jul 2025) extend the lensless paradigm to full multi-modal 5D imaging (spatial, spectral, polarization, temporal), DifuzCam prioritizes spatial image quality and semantic interpretability via powerful priors. This suggests a bifurcation in lensless computational imaging: modality expansion vs. perceptual/prior enhancement—both avenues likely to coalesce in future systems as computational resources and modeling capacities scale.

References cited in source data: [FlatCam: ICCVW 2015]; [FlatNet: ICCV 2019]; [Stable Diffusion: Rombach et al., CVPR 2022]; [ControlNet: Zhang et al., ICCV 2023]; [Spectral DiffuserCam: (Monakhova et al., 2020)].