LiNO-UniPS: Unified Photometric Stereo
- The paper introduces LiNO-UniPS, a deep learning framework that unifies lighting-invariant feature encoding with wavelet-based detail recovery for robust surface normal estimation.
- It employs novel light-register tokens and interleaved multi-image attention to decouple illumination effects from intricate geometric details in multi-illumination images.
- The approach integrates cascaded attention blocks with wavelet-aware upsampling, achieving state-of-the-art performance on standard universal photometric stereo benchmarks.
LiNO-UniPS denotes "Light of Normals: Unified Feature Representation for Universal Photometric Stereo," a deep learning framework targeting robust surface normal estimation from multi-illumination image sets without lighting, reflectance, or segmentation assumptions. It advances the universal photometric stereo (UPS) paradigm by unifying lighting–invariant feature encoding, detail-rich geometric recovery, and multi-domain learning in a single model architecture, validated on diverse benchmarks and synthetic datasets (Li et al., 23 Jun 2025).
1. Universal Photometric Stereo: Formulation and Challenges
Universal photometric stereo seeks to estimate the per-pixel normal map of an object given a fixed-view collection %%%%1%%%% of images under unknown and potentially complex lighting, without presuming known light directions or simple reflectance. The underlying image-formation process is modeled as
where is the image pixel, is the surface normal, aggregates lighting parameters for the -th view (environment, point, area), and describes material/BRDF. Unlike classical calibrated or Lambertian PS, both and are unknown and may include non-Lambertian and spatially-varying characteristics.
The practical goal is to learn a feed-forward mapping robust across lighting and reflectance ambiguity. The main technical obstacles are: (1) disentangling illumination from surface normal features where intensity variation is ambiguous; and (2) preserving high-frequency detail in surface geometry when image cues are intricate or degraded by shadow, interreflection, and nonideal lighting (Li et al., 23 Jun 2025).
2. Lighting-Invariant, Detail-Rich Feature Encoding
LiNO-UniPS introduces a multi-component encoder achieving both lighting invariance and fine geometric fidelity through three primary innovations:
- Light-Register Tokens: Three learnable tokens () are prepended to every per-image token stream, acting as attention gatherers for environment (HDRI), point, and area lighting cues across all frames. Self-attention aggregates global illumination context through these tokens.
- Interleaved Multi-Image Attention: After ViT-style patch embedding, tokens are processed through four cascaded attention blocks (Frame, LightAxis, Global, LightAxis), alternating between intra-image, inter-image, and global contextualization. This enables information flow that disentangles lighting effects from geometry and fosters global spatial awareness.
- Wavelet-Aware Down/Up Sampling: To maintain high-frequency geometry, each is decomposed into a bilinearly downsampled stream () and four discrete Haar-type wavelet bands (), which are separately tokenized and jointly encoded. A WaveUpSampler later performs inverse DWT followed by summation and smoothing to reconstruct the full-detail encoder output .
The fused encoder representation is explicitly designed to be invariant to lighting changes and rich in normal-sensitive, high-spatial-frequency content.
3. Network Architecture and Processing Pipeline
LiNO-UniPS adopts an encoder–decoder paradigm, integrating specialized stages for illumination-context integration and gradient-preserving detail recovery:
- Input Preprocessing: images are decomposed into low-pass and high-frequency wavelet bands, each receiving three light tokens.
- Patch Embedding & Transformer Backbone: Each stream is patch-embedded (patch size , embedding), producing tokens processed by a DINOv2-initialized ViT backbone.
- Cascaded Contextual Attention: Four interleaved attention blocks reinforce lighting–normal decoupling and spatial detail aggregation.
- Multi-Scale Feature Fusion: A DPT-based fusion module forms a four-level spatial pyramid, fused via residual convolution blocks to ensure feature consistency and spatial resolution.
- Wavelet-Based Detail Synthesis: The WaveUpSampler upsamples and reconstructs spatial features, using inverse DWT and smoothing for final .
- Pixel-Sampling Decoder: pixel locations per scene are randomly sampled. For each pixel, corresponding and a high-dim projection of form the input to pooling-by-multihead-attention and frame/light-axis self-attention, followed by a 2-layer MLP to predict the normal . Full maps are assembled by interpolation.
This architecture allows selective attention to both pixel-level and scene-global information, yielding robust reconstruction in complex lighting and material regimes.
4. Supervision: Loss Functions and Training
The training objective combines several loss components targeting both lighting disentanglement and accurate normal regression:
- Lighting Alignment Losses: For synthetic data with ground-truth lighting, each lighting parameter (HDRI, point, area) is mapped to a -dim vector via MLPs and encouraged to align with the learned light tokens via cosine similarity. Losses:
- Normal Regression and Gradient Loss: Predictions and their gradients are compared to ground truth via a confidence-weighted quadratic loss and explicit gradient loss:
where .
- Total Loss:
With and others adaptively set to keep auxiliary terms at the main loss.
Training uses the PS-Verse synthetic set (100K rendered scenes over five complexity/material/illumination levels), AdamW optimization, and progressive fine-tuning from simple to texture-rich scenes.
5. Benchmark Performance and Empirical Analysis
LiNO-UniPS achieves state-of-the-art normal estimation across standard universal PS benchmarks:
Ablation analyses on PS-Verse show each model component (light tokens, global attention, loss terms, wavelets, normal-gradient loss) cumulatively reduces mean angular error (MAE, up to ), with steady improvements in encoder feature SSIM/CSIM.
Quantitative results:
- DiLiGenT: LiNO-UniPS MAE (vs Uni MS-PS , SDM-UniPS , UniPS ), SOTA on most objects.
- LUCES: MAE (vs Uni MS-PS ).
- DiLiGenT: Error matrices uniformly lower than prior art.
Feature–accuracy correlation: Encoder feature CSIM/SSIM tightly predicts per-pixel normal accuracy, confirming the efficacy of the lighting-invariant, detail-rich representation.
High-resolution, real-world validation: 2K–4K, mask-free, complex-object scenes demonstrate superior fidelity to commercial 3D scanner outputs.
6. Strengths, Limitations, and Future Development
Strengths:
- Decoupling of lighting and geometric features using light tokens and global attention.
- Preservation of high-frequency detail by wavelet-based sampling and gradient-aware supervision.
- Strong cross-domain generalization (materials, lighting, sparse input counts; robust results for as few as lights).
- ECSIM/SSIM-based consistency of encoder features closely tracks output accuracy.
Limitations and directions for extension:
- Global attention stages pose high computational cost; sparser mechanisms may lower resource use.
- Occasional normal-flip ambiguity on near-planar regions in absence of explicit lighting cues.
- Current approach is single-view only; extension to multi-view could exploit geometric consistency.
- No explicit BRDF/material recovery; joint estimation with normals could enhance scene understanding.
7. Summary and Context
LiNO-UniPS represents a unification of illumination-invariant feature learning, detail-sensitive geometry representation, and scaled synthetic-supervised training for universal photometric stereo. The approach leverages light-register tokenization, interleaved attention, and wavelet-aware processing to overcome prevailing ambiguities in surface normal estimation under arbitrary lighting, reliably outperforming prior models on a broad spectrum of synthetic and real benchmarks (Li et al., 23 Jun 2025). The architecture and training strategies serve as a reference point for future research in photometric geometric learning and domain-agnostic scene reconstruction.