Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors (2505.05336v1)

Published 8 May 2025 in cs.CV

Abstract: The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.

Summary

The paper presents ProgIP, a novel method using only three IMU sensors on head and wrists to estimate 3D full-body pose in real-time via a progressive kinematic chain estimation and neural networks.
Experimental results show ProgIP outperforms state-of-the-art methods, including those using more IMUs, demonstrating improved accuracy on metrics like MJRE and MJPE across public datasets.
This approach offers significant practical implications for VR and other domains by reducing hardware complexity and enabling cost-effective, real-time motion capture without sacrificing performance.

A Technical Review of "Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors"

The paper "Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors" introduces an advanced human pose estimation technique, termed Progressive Inertial Poser (ProgIP). This method leverages only three Inertial Measurement Unit (IMU) sensors affixed to the head and wrists for virtual reality (VR) applications, positioning itself as a more efficient and portable alternative to traditional motion capture solutions that typically require a higher number of sensors.

Methodology and Innovation

The core contribution of the paper is the novel ProgIP method, which combines advanced neural network architectures and human dynamics modeling to achieve precise full-body motion estimation. The method is distinctive due to its minimal reliance on hardware, utilizing only three IMU sensors, thereby significantly reducing system complexity.

The introduced architecture forpose estimation consists of a Transformer Encoder and bidirectional LSTM (TE-biLSTM) for encoding temporal dependencies of inertial sequences and a decoder built on multi-layer perceptrons (MLPs) for transforming these features into Skinned Multi-Person Linear (SMPL) model parameters. Key to this approach is the hierarchical structure of the kinematic chain that facilitates a multi-stage progressive network estimation. This hierarchical division into four body regions aids in sequentially estimating the joint poses along the kinematic chain's depth, leading to reduced error accumulation and ensuring realistic joint movement capture.

A salient feature of the method is its capacity to outperform existing solutions that utilize more IMU sensors. It achieves this by integrating joint position consistency loss via forward kinematics into the optimization process to minimize rotational error accumulation in the kinematic chain. This characteristic is pivotal for maintaining motion naturalness, especially when dealing with dynamic and complex full-body movements.

Experimental Evaluation

The paper provides a comprehensive experimental evaluation across several public datasets, including AMASS, DIP-IMU, and TotalCapture, demonstrating that ProgIP outperforms existing state-of-the-art methods. The experimental results reveal significant improvements in both quantitative and qualitative terms even when compared to solutions using six IMU sensors.

The work emphasizes metrics like mean joint rotation error (MJRE), mean joint position error (MJPE), and mesh error (ME). ProgIP exhibits superior performance in these metrics, showcasing its robustness and accuracy in real-time applications. Notably, these evaluations highlight the capability of ProgIP to function effectively within the real-time constraints demanded by VR applications.

Implications and Future Directions

The reduction in hardware complexity and improvement in motion capture accuracy hold significant practical implications for VR and other domains that rely on human motion analysis, such as sports science and healthcare monitoring. The findings suggest that adopting systems with minimal IMU sensors without compromising performance could become a viable alternative in cost-sensitive applications.

From a theoretical perspective, this work exemplifies the potential for combining sequence modeling networks (like TE-biLSTM) with domain-specific constraints (like human kinematic models) to enhance motion capture reliability and efficiency. Future research could explore the adaptability of this approach across diverse movements and its integration with other sensory inputs to expand its applicability across broader contexts.

Conclusion

In summary, this paper presents a compelling approach to real-time 3D full-body posture estimation from significantly reduced IMU sensor inputs. The authors provide a detailed analysis showing promising outcomes both in controlled tests and potential real-world applications. As advancements in VR and related fields continue to grow, methodologies such as ProgIP could play an instrumental role in enabling more practical and widely accessible motion capture systems.