Shape-Aware Inertial Poser (SAIP)

Updated 27 October 2025

The paper introduces a novel inertial shape estimation approach that decomposes IMU signals into pose and shape components, achieving a 35% reduction in mesh error for non-standard bodies.
SAIP employs dual MLP regression networks to retarget raw accelerations and remap joint velocities from subject-specific configurations to a canonical template for precise pose inference.
The framework integrates shape-aware physical optimization with a dedicated 60-frame estimation window and a diverse benchmark dataset to validate its state-of-the-art performance.

Shape-aware Inertial Poser (SAIP) refers to a class of human motion capture methods using sparse inertial sensors (IMUs) that explicitly model and compensate for body shape differences to enable accurate 3D tracking across a wide range of individuals, including children and adults. Unlike prior approaches that presume a fixed template shape, SAIP decomposes sensor measurements into shape and pose components, applies regression networks for domain adaptation, and incorporates shape-aware physical optimization. It also introduces the first inertial shape estimation strategy and provides a benchmark dataset spanning diverse human shapes (Yin et al., 20 Oct 2025).

1. Motivation and Problem Definition

Traditional inertial motion capture pipelines—such as Sparse Inertial Poser (SIP) (Marcard et al., 2017), Deep Inertial Poser (Huang et al., 2018), and Transformer Inertial Poser (Jiang et al., 2022)—assume a fixed adult body template when mapping IMU readings to motion. This approach restricts generalization: for subjects with non-standard shapes (such as children or individuals with extreme limb proportions), the IMU-measured acceleration and derived pose estimates become inaccurate, since acceleration magnitude is strongly shape-dependent and kinematic reach varies with bone geometry. The central problem addressed by SAIP is robust pose inference in the presence of significant subject-specific anatomical variation.

2. Signal Decomposition and Regression-based Retargeting

SAIP models the observed IMU signals as the sum of two components: signals intrinsic to motion (pose) and those modulated by body shape. This separation is operationalized by:

IMU Acceleration Retargeting (R₍acc₎): A regression network (typically an MLP) maps raw shape-conditioned IMU accelerations (A_R, measured on the subject of shape β) onto the canonical template domain, producing A_T—synthetic accelerations corresponding to a standardized adult mesh. The shape vector β captures the SMPL body parameters (typically height, limb thickness, bone lengths).
Joint Velocity Remapping (R₍vel₎): After pose retrieval using state-of-the-art methods (e.g., PNP), joint velocities in template space are mapped back to the actual subject's body shape using a second regression network. This compensates for the change in kinematic coefficients resulting from anatomical differences.

This two-stage domain adaptation scheme enables re-use of existing high-performance pose estimators in a shape-agnostic fashion while maintaining fidelity to the actual body.

3. Shape-aware Physical Optimization and Pose Estimation

Motion reconstruction is further refined by a shape-aware physical optimization module. The system adapts the physical parameters governing motion synthesis—center-of-mass, moment of inertia, root joint translation—using the shape estimate β. For example, the update strategy for root joint translation (tranᵢᴛ) is:

When foot-ground contact is detected:

$tran_i^t = tran_{i-1}^t + V_i$

For non-contact periods (e.g., jumping):

$tran_i^t = tran_{i-1}^t + V_i^M$

where $V_i^M$ is velocity from the SMPL model, and $V_i$ is the joint velocity remapped with shape parameters. Dual PD controllers use mass and inertia tailored to the individual's body.

The motion sequence undergoes a global optimization pass that incorporates these physical constraints and leverages shape-informed dynamic modeling to produce a physically plausible result for each subject.

4. Inertial Shape Estimation

SAIP introduces the first inertial shape estimation scheme, which leverages a 60-frame temporal window (about one second of motion) to model the correlation between IMU signals and underlying body morphology. This is achieved by:

Inputting IMU signals and initial pose estimates to an MLP-based network that regresses the SMPL shape parameters β.
Using the subject’s height (relative to the template) to bootstrap the shape estimation.
Refining β over time, ensuring adaptation to individual differences as more motion is observed.

This continuous, online shape refinement facilitates accurate inertial mesh reconstruction and enables adaptive physical modeling for each subject.

5. Dataset Design and Experimental Validation

The Multi-shape Inertial MoCap Dataset (MID) is introduced to benchmark SAIP. It encompasses 20 subjects (10 children, 10 adults), with heights ranging from 110 cm to 190 cm and more than 400 minutes of paired IMU-motion data. The Noitom PN Studio system provides 17 reference IMUs, with the sparse configuration using 6 sensors for reconstruction. MID covers freestyle and sports actions, guaranteeing diversity in shape and motion.

Experimental findings indicate that SAIP achieves:

Significantly lower pose error (global orientation SIP error, joint angular/joint position/mesh reconstruction error) for subjects with non-template shapes compared to previous state-of-the-art methods.
35% reduction in mesh error over the next-best method, especially for children and non-standard body shapes.
Robustness in handling shape-induced acceleration variation and kinematic distortion.

Ablation studies confirm the necessity and complementary contribution of each module: acceleration retargeting (R₍acc₎), velocity mapping (R₍vel₎), inertial shape estimator, and shape-aware physical optimizer.

6. Broader Context and Comparative Framework

Earlier frameworks (SIP (Marcard et al., 2017), Deep Inertial Poser (Huang et al., 2018), Transformer Inertial Poser (Jiang et al., 2022), Ultra Inertial Poser (Armani et al., 30 Apr 2024)) focus on anthropometric priors but do not adjust for individual shape when mapping sensor signals to pose. Diffusion-based (Ilic et al., 18 Jun 2025), garment-aware (Ilic et al., 18 Jun 2025), and hand-shape calibration frameworks (Li et al., 25 Sep 2025) expand on the shape-aware paradigm but do not provide a solution for general whole-body motion capture across diverse body sizes.

SAIP is the first framework to systematically decompose sensor signals, adapt them to template space, and reconstruct subject-specific mesh and pose with joint MLP-based modeling for β, validated on large-scale diverse data.

7. Limitations and Future Directions

Current limitations include difficulty in handling motions with complex non-standard contacts (e.g., crawling), susceptibility to magnetic field perturbations, and potential loss of precision in extreme body shapes or pathological anatomy. Future work is proposed to:

Further improve robustness to environmental noise and contact events.
Enhance shape estimation accuracy to approach optical motion capture fidelity.
Extend shape-aware processing to subjects with disabilities or those interacting with terrain or objects.

A plausible implication is that methods similar to SAIP can be generalized to advanced inertial motion capture systems (including hand-specific and garment-aware capture) by explicitly modeling the complex interactions between body shape and sensor signal geometry. This also suggests that fusion frameworks for IMU and depth/LiDAR sensors could benefit from shape-aware adaptation for improved accuracy in large, unconstrained environments.

In summary, Shape-aware Inertial Poser (SAIP) presents a comprehensive solution for generalizing sparse IMU-based motion capture to individuals of widely varying shape. By learning and applying body shape conditioning at multiple points in the reconstruction pipeline, SAIP achieves state-of-the-art accuracy and robustness illustrated through quantitative improvement on the first dataset to span both adults and children (Yin et al., 20 Oct 2025).