Papers
Topics
Authors
Recent
2000 character limit reached

LiNO-UniPS: Unified Photometric Stereo

Updated 28 January 2026
  • The paper introduces LiNO-UniPS, a deep learning framework that unifies lighting-invariant feature encoding with wavelet-based detail recovery for robust surface normal estimation.
  • It employs novel light-register tokens and interleaved multi-image attention to decouple illumination effects from intricate geometric details in multi-illumination images.
  • The approach integrates cascaded attention blocks with wavelet-aware upsampling, achieving state-of-the-art performance on standard universal photometric stereo benchmarks.

LiNO-UniPS denotes "Light of Normals: Unified Feature Representation for Universal Photometric Stereo," a deep learning framework targeting robust surface normal estimation from multi-illumination image sets without lighting, reflectance, or segmentation assumptions. It advances the universal photometric stereo (UPS) paradigm by unifying lighting–invariant feature encoding, detail-rich geometric recovery, and multi-domain learning in a single model architecture, validated on diverse benchmarks and synthetic datasets (Li et al., 23 Jun 2025).

1. Universal Photometric Stereo: Formulation and Challenges

Universal photometric stereo seeks to estimate the per-pixel normal map NRH×W×3N \in \mathbb R^{H\times W\times 3} of an object given a fixed-view collection %%%%1%%%% of images under unknown and potentially complex lighting, without presuming known light directions or simple reflectance. The underlying image-formation process is modeled as

If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)

where p=(x,y)p=(x,y) is the image pixel, n(p)n(p) is the surface normal, LfL_f aggregates lighting parameters for the ff-th view (environment, point, area), and m(p)m(p) describes material/BRDF. Unlike classical calibrated or Lambertian PS, both LfL_f and m()m(\cdot) are unknown and may include non-Lambertian and spatially-varying characteristics.

The practical goal is to learn a feed-forward mapping {If}f=1FN\{I_f\}_{f=1}^F \to N robust across lighting and reflectance ambiguity. The main technical obstacles are: (1) disentangling illumination from surface normal features where intensity variation is ambiguous; and (2) preserving high-frequency detail in surface geometry when image cues are intricate or degraded by shadow, interreflection, and nonideal lighting (Li et al., 23 Jun 2025).

2. Lighting-Invariant, Detail-Rich Feature Encoding

LiNO-UniPS introduces a multi-component encoder achieving both lighting invariance and fine geometric fidelity through three primary innovations:

  • Light-Register Tokens: Three learnable tokens (xhdri,xpoint,xareaRDx_{\rm hdri}, x_{\rm point}, x_{\rm area}\in\mathbb R^D) are prepended to every per-image token stream, acting as attention gatherers for environment (HDRI), point, and area lighting cues across all frames. Self-attention aggregates global illumination context through these tokens.
  • Interleaved Multi-Image Attention: After ViT-style patch embedding, tokens are processed through four cascaded attention blocks (Frame, LightAxis, Global, LightAxis), alternating between intra-image, inter-image, and global contextualization. This enables information flow that disentangles lighting effects from geometry and fosters global spatial awareness.
  • Wavelet-Aware Down/Up Sampling: To maintain high-frequency geometry, each IfI_f is decomposed into a bilinearly downsampled stream (IfdI^d_f) and four discrete Haar-type wavelet bands (Ifll,Iflh,Ifhl,IfhhI^{ll}_f, I^{lh}_f, I^{hl}_f, I^{hh}_f), which are separately tokenized and jointly encoded. A WaveUpSampler later performs inverse DWT followed by summation and smoothing to reconstruct the full-detail encoder output FencF_{\rm enc}.

The fused encoder representation FencRF×H×W×CF_{\rm enc}\in\mathbb R^{F\times H\times W\times C} is explicitly designed to be invariant to lighting changes and rich in normal-sensitive, high-spatial-frequency content.

3. Network Architecture and Processing Pipeline

LiNO-UniPS adopts an encoder–decoder paradigm, integrating specialized stages for illumination-context integration and gradient-preserving detail recovery:

  • Input Preprocessing: FF images are decomposed into low-pass and high-frequency wavelet bands, each receiving three light tokens.
  • Patch Embedding & Transformer Backbone: Each stream is patch-embedded (patch size P=8P=8, D=384D=384 embedding), producing tokens processed by a DINOv2-initialized ViT backbone.
  • Cascaded Contextual Attention: Four interleaved attention blocks reinforce lighting–normal decoupling and spatial detail aggregation.
  • Multi-Scale Feature Fusion: A DPT-based fusion module forms a four-level spatial pyramid, fused via residual convolution blocks to ensure feature consistency and spatial resolution.
  • Wavelet-Based Detail Synthesis: The WaveUpSampler upsamples and reconstructs spatial features, using inverse DWT and smoothing for final FencF_{\rm enc}.
  • Pixel-Sampling Decoder: mm pixel locations per scene are randomly sampled. For each pixel, corresponding Fencf(xi)F^f_{\rm enc}(x_i) and a high-dim projection of If(xi)I_f(x_i) form the input to pooling-by-multihead-attention and frame/light-axis self-attention, followed by a 2-layer MLP to predict the normal n~(xi)\tilde n(x_i). Full maps are assembled by interpolation.

This architecture allows selective attention to both pixel-level and scene-global information, yielding robust reconstruction in complex lighting and material regimes.

4. Supervision: Loss Functions and Training

The training objective combines several loss components targeting both lighting disentanglement and accurate normal regression:

  • Lighting Alignment Losses: For synthetic data with ground-truth lighting, each lighting parameter (HDRI, point, area) is mapped to a DD-dim vector via MLPs and encouraged to align with the learned light tokens via cosine similarity. Losses:

Lhdri=1hdrih,xhdrih,    and analogous for point/area.\mathcal L_{\rm hdri} = 1 - \langle \ell_{\rm hdri}^h, x_{\rm hdri}^h \rangle,\;\; \text{and analogous for point/area}.

  • Normal Regression and Gradient Loss: Predictions N~\tilde N and their gradients G~=N~\tilde G = \nabla \tilde N are compared to ground truth via a confidence-weighted quadratic loss and explicit gradient loss:

Lconf=pC(p)N~(p)N(p)2,    Lgrad=pG~(p)G(p)2,\mathcal L_{\rm conf} = \sum_p C(p)\|\tilde N(p) - N(p)\|^2,\;\; \mathcal L_{\rm grad} = \sum_p \|\tilde G(p) - G(p)\|^2,

where C=exp(G~)C = \exp(\|\tilde G\|).

  • Total Loss:

Ltotal=λ1Lhdri+λ2Lpoint+λ3Larea+λ4Lconf+λ5Lgrad.\mathcal L_{\rm total} = \lambda_1\mathcal L_{\rm hdri} + \lambda_2\mathcal L_{\rm point} + \lambda_3\mathcal L_{\rm area} + \lambda_4\mathcal L_{\rm conf} + \lambda_5\mathcal L_{\rm grad}.

With λ4=1\lambda_4=1 and others adaptively set to keep auxiliary terms at 0.1×\sim 0.1\times the main loss.

Training uses the PS-Verse synthetic set (100K rendered scenes over five complexity/material/illumination levels), AdamW optimization, and progressive fine-tuning from simple to texture-rich scenes.

5. Benchmark Performance and Empirical Analysis

LiNO-UniPS achieves state-of-the-art normal estimation across standard universal PS benchmarks:

Ablation analyses on PS-Verse show each model component (light tokens, global attention, loss terms, wavelets, normal-gradient loss) cumulatively reduces mean angular error (MAE, up to 3.9-3.9^\circ), with steady improvements in encoder feature SSIM/CSIM.

Quantitative results:

  • DiLiGenT: LiNO-UniPS MAE 4.744.74^\circ (vs Uni MS-PS 4.974.97^\circ, SDM-UniPS 5.805.80^\circ, UniPS 14.7014.70^\circ), SOTA on most objects.
  • LUCES: MAE 9.489.48^\circ (vs Uni MS-PS 11.1011.10^\circ).
  • DiLiGenT2^2: Error matrices uniformly lower than prior art.

Feature–accuracy correlation: Encoder feature CSIM/SSIM tightly predicts per-pixel normal accuracy, confirming the efficacy of the lighting-invariant, detail-rich representation.

High-resolution, real-world validation: 2K–4K, mask-free, complex-object scenes demonstrate superior fidelity to commercial 3D scanner outputs.

6. Strengths, Limitations, and Future Development

Strengths:

  • Decoupling of lighting and geometric features using light tokens and global attention.
  • Preservation of high-frequency detail by wavelet-based sampling and gradient-aware supervision.
  • Strong cross-domain generalization (materials, lighting, sparse input counts; robust results for as few as K=3K=3 lights).
  • ECSIM/SSIM-based consistency of encoder features closely tracks output accuracy.

Limitations and directions for extension:

  • Global attention stages pose high computational cost; sparser mechanisms may lower resource use.
  • Occasional normal-flip ambiguity on near-planar regions in absence of explicit lighting cues.
  • Current approach is single-view only; extension to multi-view could exploit geometric consistency.
  • No explicit BRDF/material recovery; joint estimation with normals could enhance scene understanding.

7. Summary and Context

LiNO-UniPS represents a unification of illumination-invariant feature learning, detail-sensitive geometry representation, and scaled synthetic-supervised training for universal photometric stereo. The approach leverages light-register tokenization, interleaved attention, and wavelet-aware processing to overcome prevailing ambiguities in surface normal estimation under arbitrary lighting, reliably outperforming prior models on a broad spectrum of synthetic and real benchmarks (Li et al., 23 Jun 2025). The architecture and training strategies serve as a reference point for future research in photometric geometric learning and domain-agnostic scene reconstruction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiNO-UniPS.