Shadow-informed Pose Feature (SiPF)

Updated 18 November 2025

The paper demonstrates that SiPFs robustly encode pose information from shadow cues, achieving state-of-the-art performance in 3D reconstructions and rotation invariance.
SiPF is a feature family that fuses shadow geometry with learned embeddings to deliver task-adaptive performance across hand pose, 3D point clouds, and spacecraft imagery.
It utilizes techniques like differentiable rendering, global rotation regularization, and ViT-based saliency weighting to optimize pose estimation under challenging illumination conditions.

A Shadow-informed Pose Feature (SiPF) is a family of task-adaptive learned or constructed features that encode pose-discriminative information from shadow, silhouette, or illumination-induced signals for robust 2D-to-3D pose estimation, rotation-invariant 3D learning, or geometric matching. SiPF methodologies balance global pose awareness and nuisance invariance through the fusion of shadow geometry, learned embedding, and pose-consistent regressors or descriptors. Recent work operationalizes SiPFs across several domains: estimating hand pose and shape from binary shadow masks via neural regressors and differentiable rendering (Chang et al., 2022), achieving exact rotation invariance in 3D point cloud learning with global pose reference augmentation (Guo et al., 11 Nov 2025), real-time pose estimation under nonstationary shadowing in asteroid imagery (Zimmermann et al., 5 Aug 2025), and bimanual hand pose inference for inverse shadow art using vision transformers and saliency weighting (Xu et al., 11 May 2025).

1. SiPFs in Monocular Hand Pose and Shape Estimation

In Mask2Hand (Chang et al., 2022), SiPFs are defined as the intermediate 512-dimensional feature vector $F_\text{shadow}$ produced by a ResNet-18 encoder acting on a single-channel $256 \times 256$ hand silhouette mask $S_{gt}$ . The architecture processes $S_{gt}$ through standard convolutional/residual blocks:

$F_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}$

$F_\text{shadow} = \mathrm{GAP}(F_\text{shadow}^{(L)}) \in \mathbb{R}^{512},$

where $L$ is typically 18, $\sigma$ is ReLU, and $\mathrm{GAP}$ denotes global average pooling. This core SiPF is subsequently input to four MLP regression heads for low-dim MANO pose principal components $\theta$ (6 or 45D), shape $\beta$ (10D), rotation (3D axis-angle), and translation (3D). The SiPF encapsulates all pose- and shape-relevant evidence from the shadow, with the regression heads yielding pose parameters via:

$\hat\theta = W_\theta F_\text{shadow} + b_\theta,\qquad \hat\beta = W_\beta F_\text{shadow} + b_\beta,$

$\theta = \theta_\text{mean} + \Delta\theta,\qquad \beta = \beta_\text{mean} + \Delta\beta,$

with $\Delta$ denoting learned offsets.

Subsequent 3D hand mesh reconstruction is achieved through the MANO layer, $V = M(\theta, \beta)$ , and a differentiable renderer projects the estimated mesh into silhouette space. Learning is supervised by a combination of silhouette L1, cross-entropy, and optional Chamfer or joint/vertex losses:

$\mathcal{L}_\text{sil} = \sum_{u,v} | S_\text{pred}(u,v) - S_{gt}(u,v) |,$

$\mathcal{L}_{J} = \frac{1}{K} \sum_{i=1}^K \|J_i - \widehat{J}_i\|_2^2,$

allowing for self-supervised training via proxy mesh labels when no 3D ground truth is available. The SiPF approach achieves competitive single-view 3D accuracy using binary masks alone.

2. SiPFs for Rotation-Invariant 3D Point Cloud Analysis

"Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms" (Guo et al., 11 Nov 2025) introduces SiPFs as composite descriptors preserving both local geometric invariance and global pose reference. For a point cloud $X = \{x_i\} \subset \mathbb{R}^3$ , a shared global rotation matrix $R_g$ (parameterized by quaternion $q \in \mathbb{S}^3$ ) is learned via a Bingham distribution prior:

$s_i = R_g x_i = q \otimes x_i \otimes q^{-1}.$

For each local pair $(x_i, x_j)$ , the SiPF is defined as:

$\mathcal{P}_i^j = \big[ \mathrm{PPF}(x_i, x_j),\ \mathrm{SiPPF}(x_i, s_i, x_j) \big] \in \mathbb{R}^8$

$\mathrm{SiPPF}(x_i, s_i, x_j) = \frac{ \mathrm{PPF}(x_i, s_i) - \mathrm{PPF}(x_j, s_i) }{ \| \mathrm{PPF}(x_i, s_i) - \mathrm{PPF}(x_j, s_i) \|_2 }$

where $\mathrm{PPF}$ is the classical Point-Pair Feature. This design achieves true $\mathrm{SO}(3)$ invariance while encoding a globally consistent reference direction ("shadow"). SiPF descriptors feed into a Rotation-Invariant Attention Convolution (RIAttnConv) layer, where they modulate dynamic kernels in attention-based neighborhood feature aggregation:

$W_{ij} = M(\mathcal{P}_i^j),\ \text{with %%%%17%%%% an MLP};$

$x_i' = \mathrm{MLP}\bigl([\textrm{MAX}_k(V_i) - x_i\ \|\ x_i]\bigr).$

The end-to-end network jointly optimizes the global rotation parameter $q$ with a Bingham loss regularizer, ensuring that the global shadow orientation serves the downstream classification/segmentation objective.

This approach outperforms prior rotation-invariant networks on both ModelNet40 and ShapeNetPart, achieving state-of-the-art results under arbitrary test-frame $\mathrm{SO}(3)$ rotations.

3. SiPFs in Real-Time Spacecraft Pose from Shadowed Imagery

COFFEE (Zimmermann et al., 5 Aug 2025) frames SiPFs as robust keypoint features and descriptors grounded in shadow geometry for asteroid pose estimation under severe self-cast shadow conditions. Given the sun phase angle $\theta$ , the 3D to 2D sun direction is mapped through known attitude $R_c^i$ and intrinsics $K$ to provide a vanishing point $v'$ on the image:

$D_c = R_c^i D_i,\quad v' = K \left[ D_c \cdot e_x / D_c \cdot e_z, D_c \cdot e_y / D_c \cdot e_z \right]^T.$

Each image is scanned by tracing lines from candidate pixels $u$ toward $v'$ , with shadow/lit edge features $(u_k, s_k)$ extracted at strong negative/positive intensity gradients along this direction. The SiPF at each $u_k$ thus comprises location and shadow width information, providing invariance to both asteroid rotation and shadow movement.

The resulting sparse set of $(u_k, s_k)$ features is embedded with a Submanifold Sparse CNN (17 layers, 256D descriptors), then matched between consecutive frames with an attention-based GNN operating on bipartite graphs. SiPF correspondences are robust to illumination effects, with sub-pixel median reprojection error ( $\approx$ 0.3 px) and unbiased pose estimation (bias 0.00 rad, $\sigma$ 0.02 rad using 100 features), surpassing both classical (SIFT, ORB, AKAZE) and deep-learned (SuperPoint, ContextDesc, Disk, R2D2) baselines in accuracy–speed trade-off.

4. SiPFs in Bimanual Hand Inverse Shadow Synthesis

Hand-Shadow Poser (Xu et al., 11 May 2025) adopts SiPFs as DINOv2 Vision Transformer feature embeddings and saliency maps from 2D input shadow masks. The extracted shadow feature map $F_s(\hat M) \in \mathbb{R}^{14 \times 14 \times 768}$ is derived as follows:

$\hat M$ is resized and tiled to 3 channels, then processed through DINOv2 ViT-B/14.
For saliency, the $\ell_2$ norm of each patch token is computed:

$S_{ij} = \| F_s(\hat M)_{ij,:} \|_2, \qquad S \leftarrow \frac{S - \min S}{\max S - \min S}$

producing a $14 \times 14$ normalized map guiding importance weighting.

These ViT-based SiPFs are used twice in the pipeline:

Stage 2: Scoring 3D bimanual pose hypotheses via a blend of LPIPS perceptual similarity and DINOv2 global token cosine similarity between rendered masks and the target.
Stage 3: Weighting the mask similarity loss in shadow-feature-aware refinement,

$L_\text{sim} = \sum_{u,v} (1 + S_{u,v}) | M_{u,v} - \hat M_{u,v} |$

focusing optimization on distinctive regions.

This pipeline achieves successful inverse bimanual shadow matching on 85% of a 210-example benchmark.

5. Losses, Self-Supervision, and Optimization in SiPF Frameworks

SiPF-based architectures frequently incorporate differentiable renderers and custom loss compositions for supervision, enabling both direct and proxy/self-supervised training modes. In the case of Mask2Hand:

Fully supervised regimes optimize joint, vertex, and mesh silhouette losses provided 3D ground truth.
Self-supervised play leverages silhouette consistency and mesh regularization, iteratively refining $(\theta, \beta)$ estimates to align rendered and observed silhouettes over training epochs.

In point cloud SiPF models (Guo et al., 11 Nov 2025), Bingham-distributed global pose representations are optimized jointly with task and statistical losses to ensure sharp, functionally meaningful global shadow orientation. In COFFEE, the entire detection–description–matching–pose pipeline trains on correspondences reliably anchored in physically interpretable features, with shadow geometry computed in closed form given sun-sensor input.

6. Quantitative Benchmarks and Comparative Performance

Recent SiPF architectures have established new Pareto-optimal frontiers in their respective domains. The table below summarizes core quantitative results across selected works:

Domain	SiPF Implementation	Key Metric & Value	Baseline (Best Non-SiPF)
Hand Pose (Mask2Hand)	512D ResNet embed	MPJPE: 3.56cm (unaligned), PA-MPJPE: 0.68 cm	CMR (MPJPE: 4.31cm), I2L (0.74cm) (Chang et al., 2022)
Rot. Inv. Point Cloud	8D SiPF & RIAttnConv	ModelNet40 SO(3): 91.8% / ShapeNetPart mIoU: 85.1%	PaRot: 90.8%, PaRI-Conv: 84.6% (Guo et al., 11 Nov 2025)
Asteroid Pose (COFFEE)	Scanline keypoints + Sparse CNN	PR-AUC: 0.85, Opt F1: 0.87, 0.00 rad bias	SuperPoint: PR-AUC 0.68, F1 0.71 (Zimmermann et al., 5 Aug 2025)
Shadow Art (Hand-Shadow Poser)	DINOv2 ViT Saliency	85% successful pose inversion	Not specified (Xu et al., 11 May 2025)

These results show that SiPFs transfer shadow/occlusion cues into global- or task-invariant pose evidence, enabling state-of-the-art accuracy in settings where traditional appearance, texture, or depth signals are unreliable or unavailable.

7. Conceptual Implications and Future Trends

The unifying principle of SiPFs is the systematic conversion of scene shadow, silhouette, or global illumination context into pose-informative representations, often designed to neutralize nuisance variability (rotation, lighting, occlusion) and to supply invariance or reference for discriminative learning. This framework is highly adaptable: its instantiations span direct end-to-end embedding, scanline geometric measurement, and transformer-based semantic feature extraction. A plausible implication is the extensibility of SiPF strategies to broader problems—articulated pose under active or variable lighting, robust geometric matching, and multi-agent 3D perception—where global context and local evidence must be co-represented for accurate inference and self-supervised calibration.