Photometric Fusion Stereo Neural Networks

Updated 25 December 2025

Photometric Fusion Stereo Neural Networks (PFSNNs) are advanced deep learning architectures that merge photometric, spatial, and event modalities to accurately recover per-pixel surface normals under varied illumination.
They employ dual-branch designs, multi-scale attention fusion, and innovative regression techniques to capture both fine textures and global structural cues.
Evaluated on synthetic and real datasets, these networks achieve state-of-the-art performance in challenging scenarios such as sparse-light, non-Lambertian, and ambient-lit environments.

Photometric Fusion Stereo Neural Networks (PFSNN) are advanced deep learning architectures designed to recover per-pixel surface normals of objects observed under varying illumination. These networks integrate multi-image photometric observations, spatial image features, and—depending on design—auxiliary modalities such as events or multi-view cues. State-of-the-art PFSNNs combine transformer-inspired attention mechanisms, multi-scale fusion modules, modality coupling, and novel output representations. These systems are evaluated on both synthetic and real datasets, achieving superior accuracy in sparse-light, non-Lambertian, and ambient-lit scenarios.

1. Architectural Foundations and Feature Fusion

PFSNNs employ a variety of architectural designs to leverage photometric and spatial signals.

Dual-Branch and Attention Designs: PS-Transformer applies parallel branches—pixel-wise features and image-wise spatial features—fused via learnable self-attention. In the pixel-wise branch, at location $i$ , features $x^1_{j,i} = [I_{j,i}, \ell_j] \in \mathbb{R}^{c+3}$ are aggregated by stacked multi-head self-attention encoders; the image-wise branch encodes $x^2_{j,i} = [\phi(I_j, M)_i, \ell_j] \in \mathbb{R}^{67}$ , where $\phi$ is a shallow CNN over the image and mask, and again applies cross-image transformer attention. Features $f^1_i$ and $f^2_i$ are concatenated for normal regression via a shallow CNN (Ikehata, 2022).
Multi-Scale Attention Fusion: RMAFF-PSN uses separate shallow (texture-focused) and deep (contour-focused) feature pathways, each transformed by residual multi-scale attention modules (MAFF). MAFF implements parallel asymmetric convolutions, followed by channel and spatial attention— $g_c$ and $g_s$ —then merges via double-branch enhancement (DBE) and order-agnostic aggregation (max-pool over images) (Luo et al., 2024). The result is a fused representation retaining high-frequency texture and low-frequency structural cues, optimal for regions of high reflectance or geometric complexity.
Spatio-Photometric Context via 4D Convolutions: Another approach leverages separable 4D convolutions over local spatial patches ( $5 \times 5$ ) and photometric grids ( $48 \times 48$ per-pixel) (Honzátko et al., 2021). This method directly fuses photometric and spatial signals, enabling robust handling of inter-reflections and cast shadows without explicit physics-based modeling.
Modality Fusion—Event Cameras: EFPS-Net introduces cross-modal fusion by interpolating sparse, high-dynamic-range event observation maps into the RGB-derived observation space. Channel-wise gated fusion via $1 \times 1$ convs and sigmoidal activations ensures complementary contributions from event and RGB modalities, particularly in ambient-light environments (Ryoo et al., 2023).

2. Mathematical Mechanisms: Attention, Fusion, and Regression

Feature aggregation and fusion in PFSNNs rely on explicit mathematical constructs.

Self-Attention Encoding: For PS-Transformer, per-pixel features across $m$ images are encoded as $F_i^{(0)} \in \mathbb{R}^{m \times d}$ , projected to queries ( $Q$ ), keys ( $K$ ), and values ( $V$ ) for multi-head attention computation: $A(Q,K,V) = \text{softmax}(\frac{QK^\top}{\sqrt{d_k}}) V$ , followed by residual and feed-forward additions (using GeLU) (Ikehata, 2022).
Multi-Scale Residual Fusion: In RMAFF-PSN, MAFF modules fuse asymmetric-branch features with $F_{fused}(x,y) = \sum_{s \in \{\text{shallow},\text{deep}\}} \alpha_s F^{(s)}(x,y)$ , with $\alpha_s$ learned globally (Luo et al., 2024). Channel and spatial attention functions $g_c$ and $g_s$ apply weighted sigmoidal activations on average-pooled and max-pooled statistics.
Gaussian Heat-map Regression: Separable 4D convolutional methods regress surface normal directions as 2D Gaussian heat-maps in the photometric grid, with the ground-truth normal projected to $(u_0, v_0)$ and target map $M^n_{u,v} = (1/(2\pi\sigma)) \exp(-[(u-u_0)^2+(v-v_0)^2]/(2\sigma^2))$ (Honzátko et al., 2021).
Event Map Formation: In EFPS-Net, polarity-separated voxel grids $V \in \mathbb{R}^{H \times W \times B \times 2}$ are temporally binned, scaled, and merged to yield sparse event maps. These are interpolated via deep ResBlocks, outputting $\tilde{O}_e$ , a dense event observation map for fusion (Ryoo et al., 2023).

3. Training Protocols, Datasets, and Evaluation Strategies

State-of-the-art PFSNNs are trained and evaluated on large-scale synthetic and real datasets.

Synthetic Data: The CyclesPS+ dataset expands on Disney PBRSDF/Blender renders from 15 to 25 objects, applying spatially-varying BRDFs (SVBRDF) and realistic global illumination (area occlusions, indirect light, shadows) (Ikehata, 2022).
Multi-scale Data: RMAFF-PSN uses Blobby and Sculpture synthetic datasets (over 5M images) for training and public benchmarks including DiLiGenT, Apple&Gourd, and a new Simple PS dataset for real-world, sparse-light validation (Luo et al., 2024).
Cross-Modal Data: EFPS-Net constructs RGB–event paired datasets under ambient illumination, with ground-truth normals obtained via 3D-printed models and synthetic rendering. The DiLiGenT RGB–event set (10 objects) evaluates mean angular error (MAE) (Ryoo et al., 2023).
Implementation and Augmentation: Rotational invariance is enforced via K-fold rotational augmentation on light directions (e.g., $K=10$ per-sample in DiLiGenT subsets) (Honzátko et al., 2021, Ryoo et al., 2023).

4. Quantitative Results and Benchmarks

PFSNNs achieve state-of-the-art results on multiple benchmarks. Representative metrics:

Method	DiLiGenT Avg MAE (°)	DiLiGenT-MV Avg MAE (°)	Event-RGB DiLiGenT Avg MAE (°)
PS-Transformer	7.9 @ $m=10$ (Ikehata, 2022)	19.0 @ $m=10$ (Ikehata, 2022)	N/A
RMAFF-PSN	6.89 @ 96 lights (Luo et al., 2024)	N/A	N/A
Heat-map 4D Conv	6.37 @ $K_{test}=12$ (Honzátko et al., 2021)	N/A	N/A
EFPS-Net	N/A	N/A	17.71 @ $K=10$ (Ryoo et al., 2023)

PS-Transformer produces cleaner edge maps and lower angular errors than CNN-PS, PS-FCN+, and GPS-Net at sparse $m\leq10$ . RMAFF-PSN improves MAE especially on highly non-convex and shadowed regions. EFPS-Net reduces error in ambient lighting by over 1.5° compared to baseline RGB-only deep methods. Separable 4D convolutional networks achieve competitive accuracy at an order-of-magnitude lower MAC and parameter count.

5. Design Insights and Best Practices

Several architectural and implementation insights are established:

Dual-scale (shallow/deep) feature fusion is crucial for preserving textural and structural cues in complex regions (Luo et al., 2024).
Residual structures and attention modules stabilize gradients and focus capacity on critical channels and spatial regions.
Max-pooling across illumination dimension provides order-agnostic, efficient feature aggregation without complex fusion weights (Luo et al., 2024).
Gaussian heat-map regression mitigates instability and improves convergence over direct vector regression (Honzátko et al., 2021).
Event camera fusion enables robust performance under realistic illumination, overcoming dynamic range limitations in conventional RGB-only designs (Ryoo et al., 2023).
Training protocols favor heavy data augmentation, isotropy enforcement, and lightweight architectures for high-throughput inference.

6. Modalities and Extensions: Multi-View and Event Coupling

NeRF-based Fusion: Multi-view photometric stereo networks inject per-pixel normal fields from photometric stereo subnetworks into NeRF-style MLPs. The rendering color $c_i = f_\theta(\gamma(x_i), \gamma(n_i^{ps}), \gamma(d))$ enables sharp, globally-consistent mesh recovery without multi-stage pipeline complexity (Kaya et al., 2021).
Event Camera Extension: EFPS-Net utilizes asynchronous event data for dynamic scenes and ambient-light recovery (Ryoo et al., 2023).
This suggests future PFSNNs may further incorporate temporal consistency, multi-modal signal coupling, and geometry-aware rendering head adaptations.

7. Limitations and Controversies

PS-Transformer’s advantage in the dense regime is limited unless retrained on substantially larger $m$ (Ikehata, 2022).
MAFF depth (number of asymmetric branches) presents diminishing returns beyond 4; balance with compute and channel bottlenecks is required (Luo et al., 2024).
No explicit physics-based modeling of non-Lambertian and global illumination effects is employed, relying instead on dataset realism and capacity to learn robust mappings.
In multi-stage fusion frameworks, more elaborate coupling may improve some objects but comes at higher complexity and diminished scalability.

References

"PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism" (Ikehata, 2022).
"RMAFF-PSN: A Residual Multi-Scale Attention Feature Fusion Photometric Stereo Network" (Luo et al., 2024).
"Neural Radiance Fields Approach to Deep Multi-View Photometric Stereo" (Kaya et al., 2021).
"Leveraging Spatial and Photometric Context for Calibrated Non-Lambertian Photometric Stereo" (Honzátko et al., 2021).
"Event Fusion Photometric Stereo Network" (Ryoo et al., 2023).