SkelSplat: Geometric 3D Pose Estimation
- SkelSplat is a framework for multi-view 3D pose estimation that optimizes joint positions using differentiable Gaussian rendering with geometric supervision.
- It models each joint as an anisotropic Gaussian and minimizes a reprojection loss between rendered heatmaps and 2D detector outputs for enhanced robustness.
- SkelSplat generalizes to novel camera setups and occlusion conditions without relying on 3D ground truth, outperforming prior learning-based methods on benchmarks.
SkelSplat is a framework for multi‐view 3D human pose estimation that utilizes differentiable Gaussian rendering on human skeletal joints. Departing from traditional learned fusion networks reliant on annotated 3D ground‐truth data, SkelSplat models the skeleton as a set of anisotropic Gaussian distributions, optimizing joint locations by projecting them into image space and matching rendered heatmaps to detector outputs. This geometric approach supports robust performance under occlusions and generalizes to novel camera arrangements and pose distributions without retraining, outperforming previous non‐3D‐GT methods on benchmark datasets.
1. Methodological Motivation and Departure from Prior Art
Multi‐view pose estimation conventionally operates in two stages: 2D joint detection across cameras, then 3D fusion—using either classical triangulation or learned fusion networks. State‐of‐the‐art methods such as Epipolar Transformers, TransFusion, AdaFuse, and Geometry‐biased Transformers employ end‐to‐end training on large datasets (e.g., Human3.6M, CMU Panoptic), binding their fusion strategies to specific camera setups and occlusion patterns present in the training data. When deployed in altered environments, these methods frequently exhibit dramatic performance degradation, necessitating re‐training or fine‐tuning.
SkelSplat’s novelty lies in modeling each skeletal joint as an independent 3D anisotropic Gaussian whose mean and covariance are optimized using only geometric supervision—minimizing a reprojection loss between differentiable renderings of the skeleton and 2D joint heatmaps. This technique avoids any reliance on 3D ground‐truth labels or image-based photometric losses, thereby enabling out‐of‐domain generalization without retraining. A plausible implication is that its operational independence from dataset-specific biases renders SkelSplat robust to camera network variability and occlusions.
2. Skeleton Representation and One‐Hot Encoding
SkelSplat represents a human skeleton with joints, indexed by . Each joint is modeled as an anisotropic Gaussian with mean and covariance , parameterized as:
where is a rotation and is a positive scale vector. Initialization of all may be sourced from a rough 3D pose guess, such as algebraic triangulation or fused monocular estimates, and is set to .
The one‐hot joint‐ID encoding replaces classical RGB color in Gaussian splatting for scene reconstruction with an -dimensional vector for :
Thus, when splatted, each Gaussian contributes solely to its own channel, disentangling overlapping joints in image space.
3. Differentiable Rendering and Optimization Pipeline
SkelSplat executes the following pipeline for each view :
- Camera extrinsics and intrinsic Jacobians about joint yield the projected covariance:
- The 2D projected joint center: .
- Splatting Gaussian yields a heatmap for channel :
is a learned opacity, often set to 1.
For each view and joint, a pseudo-ground-truth heatmap is constructed at 2D detector output with the rendered covariance . The masked distance defines the reprojection loss:
A 3D symmetry regularizer enforces approximate equality of symmetric limb lengths. For limb pairs (e.g., elbow–wrist):
The total loss becomes:
Optimization employs Adam with gradients accumulated over all views, limited to 125 iterations and early stopping.
4. Training Protocol and Absence of 3D Ground‐Truth
Training is agnostic to 3D ground‐truth. Initialization consists of any plausible 3D joint guess (algebraic triangulation or multi-view fusion). The pipeline only leverages 2D joint detector outputs to form pseudo-heatmaps, with optimization driven by multi‐view reprojection loss. Empirical results indicate that SkelSplat’s final accuracy is robust to the choice of initialization, suggesting insensitivity to initial roughness. Cross-view gradient accumulation ensures stable, coherent parameter updates, contrasting with per‐view updates in dense-splatting for scene reconstruction.
5. Empirical Performance Metrics
Absolute mean per‐joint position error (MPJPE), without root alignment, quantifies performance:
| Evaluation Scenario | SkelSplat MPJPE (mm) | Best Prior / Baseline |
|---|---|---|
| Human3.6M (4 cams, ResNet-152 2D) | 20.3 | UPose3D: 26.2; TRL: 25.8; MV-PoseFusion: 25.6; Iskakov et al. (3D-GT): 17.7 |
| CMU→H36M (cross-domain) | 20.3 | Prior: 31–39 |
| CMU Panoptic (4 cams, MeTRAbs) | 20.9 | Alg. Triag.: 21.3 |
| H36M-Occ (Occ-2/Occ-3/Occ-3-Hard) | 24.6/27.0/34.8 | AdaFuse: 27.9/31.2; MV-PoseFusion: 37.8 |
| Occlusion-Person (8 cams/4 cams) | 30.5/40.3 | Non-target baselines: +5–6 mm |
Cross-dataset experiments reveal SkelSplat reduces error by up to 47.8% compared to learning-based methods lacking target‐domain data.
6. Robustness and Generalization Characteristics
SkelSplat retains accuracy under occlusion, with covariances of occluded joints scaled by 1.25× at initialization, facilitating greater flexibility and optimal accuracy trade-off. Under cross‐domain transfers (CMU→H36M), no performance drop is observed. Initialization is stable to Gaussian perturbations up to ~40 mm, with errors rising only 4–8%. Substantially stronger perturbations (>60 mm) degrade outputs but remain graceful.
Performance scales with view count: MPJPE improves from single‐view (>30 mm) to four views (~20 mm) and eight views (~15.6 mm) on CMU Panoptic.
7. Limitations and Future Work
Inference is currently limited by per‐channel rendering, requiring ∼3 s per frame for four views. Developing compact representations, e.g., learned joint embeddings, may offer speed gains. Only single‐person scenes have been addressed; multi-person estimation mandates per-instance initialization and viewwise association. The covariances provide calibrated uncertainty estimates, with ~98% of ground‐truth joints within 3σ on H36M, opening the possibility for uncertainty-aware downstream tasks. Integration of 3D surface reconstruction (e.g., Gaussians‐on‐mesh) with joint estimation presents opportunities for improved recovery under strong occlusion or partial visibility.
In summary, SkelSplat advocates for direct geometric optimization using differentiable Gaussian skeletons, requiring only off‐the‐shelf 2D heatmaps and camera calibrations. This geometric, rather than data-driven, fusion framework yields accuracy matching or exceeding prior learned methods, with distinct generalization and robustness benefits.