Papers
Topics
Authors
Recent
2000 character limit reached

SkelSplat: Geometric 3D Pose Estimation

Updated 17 November 2025
  • SkelSplat is a framework for multi-view 3D pose estimation that optimizes joint positions using differentiable Gaussian rendering with geometric supervision.
  • It models each joint as an anisotropic Gaussian and minimizes a reprojection loss between rendered heatmaps and 2D detector outputs for enhanced robustness.
  • SkelSplat generalizes to novel camera setups and occlusion conditions without relying on 3D ground truth, outperforming prior learning-based methods on benchmarks.

SkelSplat is a framework for multi‐view 3D human pose estimation that utilizes differentiable Gaussian rendering on human skeletal joints. Departing from traditional learned fusion networks reliant on annotated 3D ground‐truth data, SkelSplat models the skeleton as a set of anisotropic Gaussian distributions, optimizing joint locations by projecting them into image space and matching rendered heatmaps to detector outputs. This geometric approach supports robust performance under occlusions and generalizes to novel camera arrangements and pose distributions without retraining, outperforming previous non‐3D‐GT methods on benchmark datasets.

1. Methodological Motivation and Departure from Prior Art

Multi‐view pose estimation conventionally operates in two stages: 2D joint detection across cameras, then 3D fusion—using either classical triangulation or learned fusion networks. State‐of‐the‐art methods such as Epipolar Transformers, TransFusion, AdaFuse, and Geometry‐biased Transformers employ end‐to‐end training on large datasets (e.g., Human3.6M, CMU Panoptic), binding their fusion strategies to specific camera setups and occlusion patterns present in the training data. When deployed in altered environments, these methods frequently exhibit dramatic performance degradation, necessitating re‐training or fine‐tuning.

SkelSplat’s novelty lies in modeling each skeletal joint as an independent 3D anisotropic Gaussian whose mean and covariance are optimized using only geometric supervision—minimizing a reprojection loss between differentiable renderings of the skeleton and 2D joint heatmaps. This technique avoids any reliance on 3D ground‐truth labels or image-based photometric losses, thereby enabling out‐of‐domain generalization without retraining. A plausible implication is that its operational independence from dataset-specific biases renders SkelSplat robust to camera network variability and occlusions.

2. Skeleton Representation and One‐Hot Encoding

SkelSplat represents a human skeleton with NN joints, indexed by j=1,,Nj=1,\ldots,N. Each joint jj is modeled as an anisotropic Gaussian gjg_j with mean μjR3\mu_j \in \mathbb{R}^3 and covariance ΣjR3×3\Sigma_j \in \mathbb{R}^{3\times3}, parameterized as:

Σj=Rjdiag(Sj2)Rj\Sigma_j = R_j\,\mathrm{diag}(S_j^2)\,R_j^\top

where RjSO(3)R_j \in \mathrm{SO}(3) is a rotation and SjR+3S_j \in \mathbb{R}_+^3 is a positive scale vector. Initialization of all μj\mu_j may be sourced from a rough 3D pose guess, such as algebraic triangulation or fused monocular estimates, and Σj\Sigma_j is set to 3I33\,I_3.

The one‐hot joint‐ID encoding replaces classical RGB color in Gaussian splatting for scene reconstruction with an NN-dimensional vector cj[k]c_j[k] for k=1,,Nk=1,\ldots,N:

cj[k]={1k=j 0kjc_j[k] = \begin{cases} 1 & k=j \ 0 & k\neq j \end{cases}

Thus, when splatted, each Gaussian contributes solely to its own channel, disentangling overlapping joints in image space.

3. Differentiable Rendering and Optimization Pipeline

SkelSplat executes the following pipeline for each view ii:

  • Camera extrinsics WiR4×4W_i \in \mathbb{R}^{4\times4} and intrinsic Jacobians JiR2×3J_i \in \mathbb{R}^{2\times3} about joint μj\mu_j yield the projected covariance:

Σij2D=JiWiΣjWiJi\Sigma^{2D}_{ij} = J_i\,W_i\,\Sigma_j\,W_i^\top\,J_i^\top

  • The 2D projected joint center: uij=π(Ki,Wiμj)u_{ij} = \pi(K_i,\,W_i\mu_j).
  • Splatting Gaussian gjg_j yields a heatmap for channel jj:

Hijrender(u)=αjexp(12(uuij)(Σij2D)1(uuij))H^{\mathrm{render}}_{ij}(u) = \alpha_j\,\exp\left(-\tfrac{1}{2}(u-u_{ij})^\top(\Sigma_{ij}^{2D})^{-1}(u-u_{ij})\right)

αj\alpha_j is a learned opacity, often set to 1.

For each view and joint, a pseudo-ground-truth heatmap HijpseudoH^{\mathrm{pseudo}}_{ij} is constructed at 2D detector output u^ij\hat{u}_{ij} with the rendered covariance Σij2D\Sigma_{ij}^{2D}. The masked 2\ell_2 distance defines the reprojection loss:

Lrender=i=1Mj=1NMij(HijrenderHijpseudo)22\mathcal{L}_{\mathrm{render}} = \sum_{i=1}^M \sum_{j=1}^N \| M_{ij}\odot(H^{\mathrm{render}}_{ij} - H^{\mathrm{pseudo}}_{ij}) \|_2^2

A 3D symmetry regularizer enforces approximate equality of symmetric limb lengths. For limb pairs (l,r)S(l, r)\in\mathcal{S} (e.g., elbow–wrist):

Lsym=(l,r)S(pl1pl22pr1pr22)2\mathcal{L}_{\mathrm{sym}} = \sum_{(l, r)\in\mathcal{S}} \left( \|p_l^1 - p_l^2\|_2 - \|p_r^1 - p_r^2\|_2 \right)^2

The total loss becomes:

L=Lrender+λsymLsym,λsym=1×105\mathcal{L} = \mathcal{L}_{\mathrm{render}} + \lambda_{\mathrm{sym}}\mathcal{L}_{\mathrm{sym}},\quad \lambda_{\mathrm{sym}} = 1\times10^{-5}

Optimization employs Adam with gradients accumulated over all views, limited to 125 iterations and early stopping.

4. Training Protocol and Absence of 3D Ground‐Truth

Training is agnostic to 3D ground‐truth. Initialization consists of any plausible 3D joint guess (algebraic triangulation or multi-view fusion). The pipeline only leverages 2D joint detector outputs to form pseudo-heatmaps, with optimization driven by multi‐view reprojection loss. Empirical results indicate that SkelSplat’s final accuracy is robust to the choice of initialization, suggesting insensitivity to initial roughness. Cross-view gradient accumulation ensures stable, coherent parameter updates, contrasting with per‐view updates in dense-splatting for scene reconstruction.

5. Empirical Performance Metrics

Absolute mean per‐joint position error (MPJPE), without root alignment, quantifies performance:

Evaluation Scenario SkelSplat MPJPE (mm) Best Prior / Baseline
Human3.6M (4 cams, ResNet-152 2D) 20.3 UPose3D: 26.2; TRL: 25.8; MV-PoseFusion: 25.6; Iskakov et al. (3D-GT): 17.7
CMU→H36M (cross-domain) 20.3 Prior: 31–39
CMU Panoptic (4 cams, MeTRAbs) 20.9 Alg. Triag.: 21.3
H36M-Occ (Occ-2/Occ-3/Occ-3-Hard) 24.6/27.0/34.8 AdaFuse: 27.9/31.2; MV-PoseFusion: 37.8
Occlusion-Person (8 cams/4 cams) 30.5/40.3 Non-target baselines: +5–6 mm

Cross-dataset experiments reveal SkelSplat reduces error by up to 47.8% compared to learning-based methods lacking target‐domain data.

6. Robustness and Generalization Characteristics

SkelSplat retains accuracy under occlusion, with covariances of occluded joints scaled by 1.25× at initialization, facilitating greater flexibility and optimal accuracy trade-off. Under cross‐domain transfers (CMU→H36M), no performance drop is observed. Initialization is stable to Gaussian perturbations up to ~40 mm, with errors rising only 4–8%. Substantially stronger perturbations (>60 mm) degrade outputs but remain graceful.

Performance scales with view count: MPJPE improves from single‐view (>30 mm) to four views (~20 mm) and eight views (~15.6 mm) on CMU Panoptic.

7. Limitations and Future Work

Inference is currently limited by per‐channel rendering, requiring ∼3 s per frame for four views. Developing compact representations, e.g., learned joint embeddings, may offer speed gains. Only single‐person scenes have been addressed; multi-person estimation mandates per-instance initialization and viewwise association. The covariances provide calibrated uncertainty estimates, with ~98% of ground‐truth joints within 3σ on H36M, opening the possibility for uncertainty-aware downstream tasks. Integration of 3D surface reconstruction (e.g., Gaussians‐on‐mesh) with joint estimation presents opportunities for improved recovery under strong occlusion or partial visibility.

In summary, SkelSplat advocates for direct geometric optimization using differentiable Gaussian skeletons, requiring only off‐the‐shelf 2D heatmaps and camera calibrations. This geometric, rather than data-driven, fusion framework yields accuracy matching or exceeding prior learned methods, with distinct generalization and robustness benefits.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SkelSplat.