Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shadow-informed Pose Feature (SiPF)

Updated 18 November 2025
  • The paper demonstrates that SiPFs robustly encode pose information from shadow cues, achieving state-of-the-art performance in 3D reconstructions and rotation invariance.
  • SiPF is a feature family that fuses shadow geometry with learned embeddings to deliver task-adaptive performance across hand pose, 3D point clouds, and spacecraft imagery.
  • It utilizes techniques like differentiable rendering, global rotation regularization, and ViT-based saliency weighting to optimize pose estimation under challenging illumination conditions.

A Shadow-informed Pose Feature (SiPF) is a family of task-adaptive learned or constructed features that encode pose-discriminative information from shadow, silhouette, or illumination-induced signals for robust 2D-to-3D pose estimation, rotation-invariant 3D learning, or geometric matching. SiPF methodologies balance global pose awareness and nuisance invariance through the fusion of shadow geometry, learned embedding, and pose-consistent regressors or descriptors. Recent work operationalizes SiPFs across several domains: estimating hand pose and shape from binary shadow masks via neural regressors and differentiable rendering (Chang et al., 2022), achieving exact rotation invariance in 3D point cloud learning with global pose reference augmentation (Guo et al., 11 Nov 2025), real-time pose estimation under nonstationary shadowing in asteroid imagery (Zimmermann et al., 5 Aug 2025), and bimanual hand pose inference for inverse shadow art using vision transformers and saliency weighting (Xu et al., 11 May 2025).

1. SiPFs in Monocular Hand Pose and Shape Estimation

In Mask2Hand (Chang et al., 2022), SiPFs are defined as the intermediate 512-dimensional feature vector FshadowF_\text{shadow} produced by a ResNet-18 encoder acting on a single-channel 256×256256 \times 256 hand silhouette mask SgtS_{gt}. The architecture processes SgtS_{gt} through standard convolutional/residual blocks:

Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}

Fshadow=GAP(Fshadow(L))R512,F_\text{shadow} = \mathrm{GAP}(F_\text{shadow}^{(L)}) \in \mathbb{R}^{512},

where LL is typically 18, σ\sigma is ReLU, and GAP\mathrm{GAP} denotes global average pooling. This core SiPF is subsequently input to four MLP regression heads for low-dim MANO pose principal components θ\theta (6 or 45D), shape 256×256256 \times 2560 (10D), rotation (3D axis-angle), and translation (3D). The SiPF encapsulates all pose- and shape-relevant evidence from the shadow, with the regression heads yielding pose parameters via:

256×256256 \times 2561

256×256256 \times 2562

with 256×256256 \times 2563 denoting learned offsets.

Subsequent 3D hand mesh reconstruction is achieved through the MANO layer, 256×256256 \times 2564, and a differentiable renderer projects the estimated mesh into silhouette space. Learning is supervised by a combination of silhouette L1, cross-entropy, and optional Chamfer or joint/vertex losses:

256×256256 \times 2565

256×256256 \times 2566

allowing for self-supervised training via proxy mesh labels when no 3D ground truth is available. The SiPF approach achieves competitive single-view 3D accuracy using binary masks alone.

2. SiPFs for Rotation-Invariant 3D Point Cloud Analysis

"Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms" (Guo et al., 11 Nov 2025) introduces SiPFs as composite descriptors preserving both local geometric invariance and global pose reference. For a point cloud 256×256256 \times 2567, a shared global rotation matrix 256×256256 \times 2568 (parameterized by quaternion 256×256256 \times 2569) is learned via a Bingham distribution prior:

SgtS_{gt}0

For each local pair SgtS_{gt}1, the SiPF is defined as:

SgtS_{gt}2

SgtS_{gt}3

where SgtS_{gt}4 is the classical Point-Pair Feature. This design achieves true SgtS_{gt}5 invariance while encoding a globally consistent reference direction ("shadow"). SiPF descriptors feed into a Rotation-Invariant Attention Convolution (RIAttnConv) layer, where they modulate dynamic kernels in attention-based neighborhood feature aggregation:

SgtS_{gt}6

SgtS_{gt}7

The end-to-end network jointly optimizes the global rotation parameter SgtS_{gt}8 with a Bingham loss regularizer, ensuring that the global shadow orientation serves the downstream classification/segmentation objective.

This approach outperforms prior rotation-invariant networks on both ModelNet40 and ShapeNetPart, achieving state-of-the-art results under arbitrary test-frame SgtS_{gt}9 rotations.

3. SiPFs in Real-Time Spacecraft Pose from Shadowed Imagery

COFFEE (Zimmermann et al., 5 Aug 2025) frames SiPFs as robust keypoint features and descriptors grounded in shadow geometry for asteroid pose estimation under severe self-cast shadow conditions. Given the sun phase angle SgtS_{gt}0, the 3D to 2D sun direction is mapped through known attitude SgtS_{gt}1 and intrinsics SgtS_{gt}2 to provide a vanishing point SgtS_{gt}3 on the image:

SgtS_{gt}4

Each image is scanned by tracing lines from candidate pixels SgtS_{gt}5 toward SgtS_{gt}6, with shadow/lit edge features SgtS_{gt}7 extracted at strong negative/positive intensity gradients along this direction. The SiPF at each SgtS_{gt}8 thus comprises location and shadow width information, providing invariance to both asteroid rotation and shadow movement.

The resulting sparse set of SgtS_{gt}9 features is embedded with a Submanifold Sparse CNN (17 layers, 256D descriptors), then matched between consecutive frames with an attention-based GNN operating on bipartite graphs. SiPF correspondences are robust to illumination effects, with sub-pixel median reprojection error (Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}00.3 px) and unbiased pose estimation (bias 0.00 rad, Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}1 0.02 rad using 100 features), surpassing both classical (SIFT, ORB, AKAZE) and deep-learned (SuperPoint, ContextDesc, Disk, R2D2) baselines in accuracy–speed trade-off.

4. SiPFs in Bimanual Hand Inverse Shadow Synthesis

Hand-Shadow Poser (Xu et al., 11 May 2025) adopts SiPFs as DINOv2 Vision Transformer feature embeddings and saliency maps from 2D input shadow masks. The extracted shadow feature map Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}2 is derived as follows:

  • Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}3 is resized and tiled to 3 channels, then processed through DINOv2 ViT-B/14.
  • For saliency, the Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}4 norm of each patch token is computed:

Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}5

producing a Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}6 normalized map guiding importance weighting.

These ViT-based SiPFs are used twice in the pipeline:

  • Stage 2: Scoring 3D bimanual pose hypotheses via a blend of LPIPS perceptual similarity and DINOv2 global token cosine similarity between rendered masks and the target.
  • Stage 3: Weighting the mask similarity loss in shadow-feature-aware refinement,

Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}7

focusing optimization on distinctive regions.

This pipeline achieves successful inverse bimanual shadow matching on 85% of a 210-example benchmark.

5. Losses, Self-Supervision, and Optimization in SiPF Frameworks

SiPF-based architectures frequently incorporate differentiable renderers and custom loss compositions for supervision, enabling both direct and proxy/self-supervised training modes. In the case of Mask2Hand:

  • Fully supervised regimes optimize joint, vertex, and mesh silhouette losses provided 3D ground truth.
  • Self-supervised play leverages silhouette consistency and mesh regularization, iteratively refining Fshadow(l)=σ(Conv(l)(Fshadow(l1))),Fshadow(0)=SgtF_\text{shadow}^{(l)} = \sigma(\mathrm{Conv}^{(l)}(F_\text{shadow}^{(l-1)})),\quad F_\text{shadow}^{(0)} = S_{gt}8 estimates to align rendered and observed silhouettes over training epochs.

In point cloud SiPF models (Guo et al., 11 Nov 2025), Bingham-distributed global pose representations are optimized jointly with task and statistical losses to ensure sharp, functionally meaningful global shadow orientation. In COFFEE, the entire detection–description–matching–pose pipeline trains on correspondences reliably anchored in physically interpretable features, with shadow geometry computed in closed form given sun-sensor input.

6. Quantitative Benchmarks and Comparative Performance

Recent SiPF architectures have established new Pareto-optimal frontiers in their respective domains. The table below summarizes core quantitative results across selected works:

Domain SiPF Implementation Key Metric & Value Baseline (Best Non-SiPF)
Hand Pose (Mask2Hand) 512D ResNet embed MPJPE: 3.56cm (unaligned), PA-MPJPE: 0.68 cm CMR (MPJPE: 4.31cm), I2L (0.74cm) (Chang et al., 2022)
Rot. Inv. Point Cloud 8D SiPF & RIAttnConv ModelNet40 SO(3): 91.8% / ShapeNetPart mIoU: 85.1% PaRot: 90.8%, PaRI-Conv: 84.6% (Guo et al., 11 Nov 2025)
Asteroid Pose (COFFEE) Scanline keypoints + Sparse CNN PR-AUC: 0.85, Opt F1: 0.87, 0.00 rad bias SuperPoint: PR-AUC 0.68, F1 0.71 (Zimmermann et al., 5 Aug 2025)
Shadow Art (Hand-Shadow Poser) DINOv2 ViT Saliency 85% successful pose inversion Not specified (Xu et al., 11 May 2025)

These results show that SiPFs transfer shadow/occlusion cues into global- or task-invariant pose evidence, enabling state-of-the-art accuracy in settings where traditional appearance, texture, or depth signals are unreliable or unavailable.

The unifying principle of SiPFs is the systematic conversion of scene shadow, silhouette, or global illumination context into pose-informative representations, often designed to neutralize nuisance variability (rotation, lighting, occlusion) and to supply invariance or reference for discriminative learning. This framework is highly adaptable: its instantiations span direct end-to-end embedding, scanline geometric measurement, and transformer-based semantic feature extraction. A plausible implication is the extensibility of SiPF strategies to broader problems—articulated pose under active or variable lighting, robust geometric matching, and multi-agent 3D perception—where global context and local evidence must be co-represented for accurate inference and self-supervised calibration.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shadow-informed Pose Feature (SiPF).