- The paper introduces a dual coordinate fusion strategy to integrate visual and IMU data, significantly reducing occlusion and drift errors in motion capture.
- It employs a hidden state feedback mechanism that dynamically corrects errors, with evaluations on challenging datasets like AIST++ and TotalCapture.
- The results highlight the method's potential for real-time applications in consumer electronics and virtual reality by enhancing accuracy and robustness.
Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture
This paper addresses the ongoing challenges in human motion capture (mocap) by introducing an innovative method that combines monocular images with sparse Inertial Measurement Unit (IMU) data. Historically, mocap systems have relied separately on visual data (RGB images) or inertial sensor data for motion estimation. Each modality has strengths and limitations: visual mocap suffers from occlusions, lighting issues, and out-of-view challenges, while IMU-based mocap is prone to global drift due to cumulative sensor errors. This method leverages the complementary nature of both data types to enhance mocap robustness and accuracy.
The core contribution of the paper is a dual coordinate strategy designed to effectively integrate the two input modalities. This strategy involves two branches: one transforms IMU data into the camera coordinate system to pair with image-derived information, and the other processes IMU signals within the human root coordinate system to improve body pose estimation. This approach capitalizes on the complementary strengths of camera data (for global position and local pose) and IMU data (which excels despite visual input issues like occlusion or insufficient lighting).
A noteworthy aspect of this method is the introduction of a hidden state feedback mechanism. This mechanism dynamically adjusts the internal state of both branches, using optimal information from each to correct potential estimation errors, such as drift in global motion estimation. The feedback loop ensures continuous improvement in motion capture accuracy, particularly in cases where one modality’s input is unreliable or unavailable.
The authors rigorously evaluate their method against state-of-the-art mocap techniques, demonstrating superior accuracy in both global orientation and local pose estimation across diverse datasets. These datasets include the challenging AIST++ (involving significant motion dynamics) and TotalCapture (introducing out-of-camera view scenarios), thereby highlighting the practicality of this integrated approach in real-world applications.
The findings suggest significant implications for the development of real-time mocap systems in consumer electronics, such as smartphones or AR glasses, which increasingly incorporate both camera and IMU sensors. Additionally, the fusion method could influence advancements in virtual reality, where precise, low-latency mocap enhances immersion and interactivity.
Future research directions may explore the fusion approach under more diverse conditions, such as long-duration capture where IMU data drift predominates. The expansion to include variable body shapes could also refine motion estimation accuracy, further enhancing the system’s applicability to a broader demographic.
In summation, this paper contributes a novel, robust solution for real-time human motion capture by intelligently combining complementary data sources. The improvements in accuracy and robustness over existing models establish this fusion technique as a promising approach for enhancing next-generation mocap systems.