Photometric Stereo in 3D Vision

Updated 7 April 2026

Photometric stereo is a technique that estimates per-pixel surface normals and depth by analyzing images under variable illumination, enabling precise 3D reconstruction.
Classical methods assume Lambertian reflectance while modern approaches incorporate non-Lambertian, near-field, and learning-based models to handle real-world complexities.
Recent advances fuse multi-modal, event-based, and deep learning techniques to enhance robustness and real-time performance in applications like robotics and biomedical imaging.

Photometric stereo is a foundational computer vision technique for recovering per-pixel surface normals (and, via integration, depth) of a static object by analyzing images captured from a fixed viewpoint under varying illumination. The method exploits the relationship between surface orientation and measured intensity, leveraging controlled or measured changes in lighting to infer 3D shape. Over the past four decades, photometric stereo has undergone substantial evolution, now encompassing diverse illumination scenarios, reflectance models, sensor modalities, and learning-based architectures.

1. Mathematical Principles and Image Formation

Classical photometric stereo models the image intensity at each pixel under the assumption of orthographic projection, directional lighting, and Lambertian reflectance. For a pixel $p$ , the intensity under the $k$ -th light is

$I_p^k = \rho_p\,\boldsymbol{\ell}_k^\top \boldsymbol{n}_p,$

where $\rho_p$ is albedo, $\boldsymbol{\ell}_k\in \mathbb{R}^3$ is the known light direction, and $\boldsymbol{n}_p$ the unit surface normal. Collecting $d\ge 3$ images under different illuminations yields a linear system $Y = N L$ with $Y\in \mathbb{R}^{m\times d}$ , $N\in \mathbb{R}^{m\times 3}$ , $k$ 0.

In reality, non-Lambertian effects (specularity, shadows, spatially/frequency-varying BRDFs), perspective, and near-field effects violate this idealization. The general model introduces a local or spatially-varying BRDF function $k$ 1 and, for color or near-field lighting, explicit dependence on spectral and geometric parameters. Image formation may therefore be written as

$k$ 2

and can be extended to account for ambient light, participating media, near-point sources, or combinations thereof (Chen et al., 2018, Li et al., 19 Jan 2026, Fujimura et al., 2018).

2. Model Extensions: Non-Lambertian and Near-Field Regimes

Lambertian methods are fragile in the presence of specularities, shadows, and non-ideal material properties. Non-Lambertian photometric stereo approaches can be grouped into:

Analytic reflectance modeling: Employing parametric BRDFs (e.g., Blinn–Phong) or incorporating perspective (Khanian et al., 2017).
Data-driven learning: Deep convolutional and transformer-based networks that regress normals from intensities and known (or estimated) light directions, bypassing explicit BRDF parameterization (Chen et al., 2020, Ju et al., 2022).
Inverse rendering frameworks: Optimizing geometry and appearance parameters (e.g., via neural implicit representations or differentiable renderers) guided by the physical image formation equations, potentially unsupervised (Taniai et al., 2018, Ducastel et al., 9 Jul 2025).

Near-field photometric stereo accurately models lighting attenuation, anisotropy, and geometric variation:

$k$ 3

where $k$ 4 describes intensity attenuation from light position/direction, and $k$ 5 is the BRDF (Lichy et al., 2022, Li et al., 19 Jan 2026). Solutions employ recursive multi-resolution architectures or neural implicit representations to achieve robustness and real-time inference (Lichy et al., 2022, Li et al., 19 Jan 2026).

3. Learning-Based Architectures and Fusion Strategies

Modern photometric stereo exploits learning-based fusion at multiple levels. Principal schemes include:

Per-pixel fusion via observation maps: Each pixel's intensity under all lights is projected (possibly with the corresponding light direction) into a feature or grid for per-pixel processing (Ju et al., 2022).
Global feature aggregation: Siamese CNNs extract per-image features, then channel-wise pooling (e.g., max/attention/self-attention) aggregates across illuminations, supporting order-agnostic, variable-length input (Chen et al., 2020, Hardy et al., 2022, Ju et al., 2022).
Multi-scale and hybrid approaches: Networks refine initial coarse predictions with finer-scale spatial context, or combine inter-frame (across images) and intra-frame (spatial) features for improved accuracy, handling arbitrarily sized inputs (Hardy et al., 2022, Cao et al., 2020).

Tabular summary of key learning-based paradigms:

Fusion Approach	Network Module	Key Papers
Per-pixel (obs-map)	DenseNet, Attention	(Ju et al., 2022)
Feature-pooling	Siamese CNN, Max-pool	(Chen et al., 2018, Chen et al., 2020)
Transformer/global	Set/Aggregate Attention	(Ikehata, 2022)
Multi-scale	Pyramid CNNs	(Hardy et al., 2022)

These architectures are often trained on large physically grounded synthetic datasets covering diverse shapes and BRDFs to enable generalization to real-world surfaces (Hardy et al., 2022).

4. Robustness: Dictionary Learning, Unsupervised Learning, and Participating Media

Robust photometric stereo methods address non-idealities in both reflectance and capture environments:

Dictionary learning: Regularizes either the image stack or the normal vector field to be locally sparse in a learned basis, suppressing noise, shadows, and outliers without direct BRDF modeling (Wagenmaker et al., 2017, Wagenmaker et al., 2017).
Unsupervised and self-supervised learning: Unrolls test-time optimization of both normals and a parameterized or learned BRDF, minimizing reconstruction error between observed and re-rendered images without ground truth normals (Taniai et al., 2018).
Participating media: Analytical models for forward/backward scatter are used to invert the dense radiance-mixing matrix via sparse approximations, supporting 3D recovery inside turbid or underwater environments (Fujimura et al., 2018).

Each approach provides specific robustness against corruptions, lack of supervision, or physically adverse imaging conditions.

Emerging modalities exploit hardware characteristics:

Event-based photometric stereo: An event camera, in combination with a rotating single light source, provides high dynamic range, temporal resolution, and robustness to ambient light. Surface normals are recovered from per-pixel binned event histograms using lightweight MLPs. Analytical and learned methods are combined, with learning yielding improved angular error (12.24° vs. baselines up to 16.91° on DiLiGenT-EV) and extreme robustness under HDR conditions (Kim et al., 11 Mar 2026).
Event fusion with RGB: Fusion networks jointly process event signals and per-frame RGB data, leveraging the sparsity and temporal precision of events to complement intensity maps, outperforming RGB-only networks under challenging ambient conditions (Ryoo et al., 2023).

6. Multi-View and Outdoor Scenarios

Classical photometric stereo assumes a static viewpoint and controlled lighting. Recent research advances toward:

Multi-view photometric stereo: Combines per-view normal estimation (via iso-depth contours or per-view photometric stereo) with multi-view geometric fusion (e.g., via SfM and Poisson meshing), supporting recovery of complete shape and spatially-varying isotropic BRDFs even with perspective and near-field lighting (Li et al., 2020).
Outdoor and universal lighting: Conditioning analyses show that clear sky days are fundamentally ill-posed for calibrated outdoor photometric stereo, but mixed or partially cloudy conditions “randomize” the lighting subspace and render the problem solvable by classical algorithms. Weakly-calibrated networks can combine photometric cues with data-driven priors for single-day shape estimation (Hold-Geoffroy et al., 2018). Universal photometric stereo dispenses with any parametric lighting model, learning global lighting contexts via transformer-based fusion and achieving significant MAE improvements under arbitrary, spatially-varying illumination (Ikehata, 2022).

7. Quantitative Benchmarks, Limitations, and Future Directions

Benchmark evaluations on DiLiGenT and related datasets show continued reductions in mean angular error (MAE) for surface normals as methods evolve from analytic to deep learning and hybrid approaches. For example, PS-FCN achieves average MAE of 8.39° (Chen et al., 2018), with multi-scale variants and transformer-based architectures now matching or exceeding 6.3° MAE (Hardy et al., 2022, Ju et al., 2022).

Key limitations and open challenges include:

Insensitivity to extreme BRDFs, cast shadows, or spatially complex illumination in classical pipelines.
Dependence on synthetic data for supervised training and limited real data for fine-tuning or domain transfer.
Handling of cast shadow and interreflection remains sub-optimal, especially in explicit or analytic frameworks (Ducastel et al., 9 Jul 2025).
Requirement for accurate light calibration or self-calibration, especially for near-field/color PS and single-shot systems (Chen et al., 2019, Li et al., 19 Jan 2026).
Real-time or energy-constrained inference, particularly for edge and robotic applications, addressed by event-based and lightweight recursive networks (Kim et al., 11 Mar 2026, Lichy et al., 2022).

Future directions are anticipated in hybrid physics-learning models, scalable self-supervised pipelines, universal lighting representations, and integration with neural rendering and NeRF, enabling seamless multi-view/appearance fusion (Ju et al., 2022, Ducastel et al., 9 Jul 2025).

Photometric stereo continues to be an active and richly cross-disciplinary domain, combining principles from radiometry, geometry, optimization, and deep learning, with applications ranging from robotic perception and cultural heritage digitization to in-the-wild object scanning and biomedical imaging.