MS-PS: Multi-Scale Network for Photometric Stereo

Updated 1 July 2025

MS-PS is a deep learning framework for photometric stereo that employs a multi-scale network architecture and is trained on a large, diverse synthetic dataset.
The network architecture handles variable input images and arbitrary image sizes, allowing robust surface normal estimation for objects with complex materials like metals and glass.
MS-PS achieved state-of-the-art accuracy on benchmarks like DiLiGenT, demonstrating improved generalization and practical applicability in fields such as industrial inspection and AR/VR.

Uni MS-PS refers to "MS-PS: A Multi-Scale Network for Photometric Stereo," a deep learning framework for photometric stereo that employs a multi-scale architecture and is trained on a comprehensive new dataset. The method addresses the challenge of inferring accurate 3D surface normal fields from images under varying illumination, with a focus on flexibility, scalability, and robustness to complex materials and high-resolution imaging scenarios.

1. Photometric Stereo and the Multi-Scale Problem

Photometric stereo (PS) is a computer vision technique that reconstructs the surface normals of a 3D object using multiple 2D images acquired under different lighting directions. Classic approaches require either strict calibration or are limited by their inability to handle non-Lambertian reflectance, high-frequency geometric details, or varying numbers and sizes of input images.

The multi-scale problem in PS refers to capturing fine surface details (such as microgeometry or sharp edges) while also reasoning about global object structure. Real-world surfaces exhibit spatial variation at both coarse and fine scales, and materials such as metal or glass introduce anisotropy and reflectance non-uniformity, further complicating the normal estimation.

2. Multi-Scale Network Architecture

The MS-PS method introduces a hierarchical, multi-scale convolutional neural network for photometric stereo. Its architecture enables variable input images and arbitrary image sizes without sacrificing performance.

Stage-wise Operation: The network processes the input at several spatial resolutions, from coarse to fine. At each stage:
- Input images are downsampled to the current resolution.
- Surface normals are predicted at the given scale.
- The predicted normals are upsampled and concatenated as an auxiliary input for finer-scale processing.
Weight Sharing: All refinement stages beyond the first share the same network parameters, ensuring both efficiency and architectural consistency regardless of the number of scales used.
Pooling for Variable Inputs: The design includes a pooling operation that allows the network to aggregate features across a variable number of input images, supporting use cases with different numbers of available views.
Arbitrary Image Sizes: The network is fully convolutional (except for the input pooling), so it can accept images of any spatial resolution, enabling inference on images up to and beyond 1000×1000 pixels.

The training objective minimizes the cosine (angular) error between the estimated normal vectors and the ground truth: $l_{normal} = 1 - \sum_{ij} N_{ij}^\top \hat N_{ij}$ where $N_{ij}$ denotes the ground truth normal and $\hat N_{ij}$ the predicted normal at pixel $(i, j)$ .

3. Dataset for Training and Benchmarking

To enable generalization and robust performance, MS-PS uses a synthetic dataset that is substantially more comprehensive than previously available datasets.

Geometric Diversity: 3,000 smoothed “blobby” objects and 76 detailed, realistic 3D meshes are included, ensuring both abstract and real-world geometric features.
Material Diversity: Over 1,100 scanned real materials (ambientCG) and physically-based procedural materials (parametrized with Disney's BSDF model) are used, introducing a wide range of specularities, anisotropies, spatially varying reflectances, and transparency effects.
Lighting Variations: Each object is rendered under 100 randomly sampled lighting directions, with intensity and direction distributions matched to real-world scenarios.
Scale: The aggregate comprises 60,000 unique objects, each seen under 100 lighting conditions.
Accessibility: The dataset and codebase are made available to support benchmarking and reproducibility in the field.

4. Experimental Performance and Comparative Analysis

MS-PS is evaluated on the DiLiGenT and DiLiGenT10² benchmarks, which comprise real objects with diverse material properties and lighting conditions.

Accuracy: The multi-scale architecture, trained on the new dataset, achieves mean angular errors of 5.84° on DiLiGenT and 11.33° on DiLiGenT10^2, outperforming previous state-of-the-art methods such as PX-NET and OB-CNN.
Robustness to Material Properties: The framework is particularly effective at handling objects with anisotropic or translucent materials (e.g., metals, glass, acrylic), where earlier methods exhibited significant degradation.
Resolution and Flexibility: The model retains accuracy at resolutions not seen during training and with varying input set sizes, demonstrating robust generalization.

Aspect	Key Feature	Result/Impact
Architecture	Multi-scale, weight sharing, variable input/image sizes	State-of-the-art flexibility
Loss Function	Angular error (cosine similarity)	Accurate normal estimation
Training Dataset	60K objects, 1100+ materials, variable geometry	Broad generalization
Benchmarks	DiLiGenT, DiLiGenT10²	Top mean angular accuracy
Applications	Industry, cultural heritage, robotics, AR/VR	Broad real-world applicability

5. Practical Applications and Implications

The advancements introduced by MS-PS have implications across several domains:

Industrial Inspection: Surface anomaly detection on reflective or metallic products where surface detail and complex lighting confound classical methods.
Cultural Heritage Preservation: High-fidelity 3D digitization of artifacts, including those with weathered, heterogeneous material appearance.
Robotics & Manipulation: Reliable perception of object surfaces under unknown lighting, supporting grasp planning and manipulation in unstructured settings.
Augmented/Virtual Reality: Generation of photorealistic 3D assets that maintain geometric and material fidelity across simulated and real lighting scenarios.
Medical Imaging: Extraction of fine anatomical surface details from images under varied acquisition protocols.

The scalable and flexible design of MS-PS, particularly its ability to process images of differing numbers and sizes, paves the way for integration into practical systems where input conditions are not constrained or calibrated.

6. Limitations and Outlook

Despite its notable strengths, MS-PS still faces challenges with extremely complex lighting/material scenarios (e.g., extreme anisotropy or translucency may still induce errors in certain edge regions or highlight reconstructions). While it achieves leading results on public benchmarks, further advances in feature disentanglement between lighting and geometry—as illustrated by subsequent approaches introducing attention-based disentanglement and light tokens—have begun to outpace MS-PS, particularly in universal photometric stereo settings where true invariance to illumination is critical.

Ongoing research is extending these innovations by combining multi-scale refinement with explicit lighting-geometry decoupling and developing even more physically-plausible synthetic datasets. The principles introduced in MS-PS, notably architecture design and dataset construction, continue to shape the direction of photometric stereo research and its broader adoption.

7. Summary and Significance

Uni MS-PS established a new empirical benchmark for deep photometric stereo by synthesizing a scalable multi-scale architecture and a comprehensive synthetic dataset, yielding robust and accurate surface normal estimation on diverse and challenging materials. Its methodology underpins many subsequent developments in PS, offering a foundation for the handling of high-resolution, real-world scenarios, and catalyzing further innovations in disentanglement and universal inference.

PDF Markdown Chat (Upgrade)