Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture (2309.00310v1)

Published 1 Sep 2023 in cs.CV

Abstract: Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. %The two divided parts can help each other for better mocap results under different conditions. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a dual coordinate fusion strategy to integrate visual and IMU data, significantly reducing occlusion and drift errors in motion capture.
It employs a hidden state feedback mechanism that dynamically corrects errors, with evaluations on challenging datasets like AIST++ and TotalCapture.
The results highlight the method's potential for real-time applications in consumer electronics and virtual reality by enhancing accuracy and robustness.

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

This paper addresses the ongoing challenges in human motion capture (mocap) by introducing an innovative method that combines monocular images with sparse Inertial Measurement Unit (IMU) data. Historically, mocap systems have relied separately on visual data (RGB images) or inertial sensor data for motion estimation. Each modality has strengths and limitations: visual mocap suffers from occlusions, lighting issues, and out-of-view challenges, while IMU-based mocap is prone to global drift due to cumulative sensor errors. This method leverages the complementary nature of both data types to enhance mocap robustness and accuracy.

The core contribution of the paper is a dual coordinate strategy designed to effectively integrate the two input modalities. This strategy involves two branches: one transforms IMU data into the camera coordinate system to pair with image-derived information, and the other processes IMU signals within the human root coordinate system to improve body pose estimation. This approach capitalizes on the complementary strengths of camera data (for global position and local pose) and IMU data (which excels despite visual input issues like occlusion or insufficient lighting).

A noteworthy aspect of this method is the introduction of a hidden state feedback mechanism. This mechanism dynamically adjusts the internal state of both branches, using optimal information from each to correct potential estimation errors, such as drift in global motion estimation. The feedback loop ensures continuous improvement in motion capture accuracy, particularly in cases where one modality’s input is unreliable or unavailable.

The authors rigorously evaluate their method against state-of-the-art mocap techniques, demonstrating superior accuracy in both global orientation and local pose estimation across diverse datasets. These datasets include the challenging AIST++ (involving significant motion dynamics) and TotalCapture (introducing out-of-camera view scenarios), thereby highlighting the practicality of this integrated approach in real-world applications.

The findings suggest significant implications for the development of real-time mocap systems in consumer electronics, such as smartphones or AR glasses, which increasingly incorporate both camera and IMU sensors. Additionally, the fusion method could influence advancements in virtual reality, where precise, low-latency mocap enhances immersion and interactivity.

Future research directions may explore the fusion approach under more diverse conditions, such as long-duration capture where IMU data drift predominates. The expansion to include variable body shapes could also refine motion estimation accuracy, further enhancing the system’s applicability to a broader demographic.

In summation, this paper contributes a novel, robust solution for real-time human motion capture by intelligently combining complementary data sources. The improvements in accuracy and robustness over existing models establish this fusion technique as a promising approach for enhancing next-generation mocap systems.

PDF Markdown

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture (2309.00310v1)

Summary

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

Related Papers

GitHub