Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time (1810.04703v1)

Published 10 Oct 2018 in cs.GR and cs.CV

Abstract: We demonstrate a novel deep neural network capable of reconstructing human full body pose in real-time from 6 Inertial Measurement Units (IMUs) worn on the user's body. In doing so, we address several difficult challenges. First, the problem is severely under-constrained as multiple pose parameters produce the same IMU orientations. Second, capturing IMU data in conjunction with ground-truth poses is expensive and difficult to do in many target application scenarios (e.g., outdoors). Third, modeling temporal dependencies through non-linear optimization has proven effective in prior work but makes real-time prediction infeasible. To address this important limitation, we learn the temporal pose priors using deep learning. To learn from sufficient data, we synthesize IMU data from motion capture datasets. A bi-directional RNN architecture leverages past and future information that is available at training time. At test time, we deploy the network in a sliding window fashion, retaining real time capabilities. To evaluate our method, we recorded DIP-IMU, a dataset consisting of $10$ subjects wearing 17 IMUs for validation in $64$ sequences with $330\,000$ time instants; this constitutes the largest IMU dataset publicly available. We quantitatively evaluate our approach on multiple datasets and show results from a real-time implementation. DIP-IMU and the code are available for research purposes.

Citations (168)

View on Semantic Scholar

Summary

The paper introduces Deep Inertial Poser, a deep learning approach using RNNs to reconstruct 3D human pose in real-time from only six sparse IMU sensors.
Achieves competitive pose accuracy (approx 15.85° joint error on TotalCapture) compared to state-of-the-art offline methods like SIP, while operating in real time.
Enables robust, real-time human pose estimation in diverse environments, paving the way for infrastructure-free motion capture in VR/AR and HCI applications.

Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time

The paper "Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time" introduces a novel approach to human pose estimation using only six Inertial Measurement Units (IMUs). The proposed method leverages a deep neural network architecture designed to predict 3D human body pose in real-time, a significant departure from previous offline optimization approaches that rely on dense IMU placement or camera-based systems requiring controlled setups.

Methodology Overview

The authors address the inherent challenge of reconstructing human pose from sparse IMUs by employing a deep learning approach based on recurrent neural networks (RNNs). The key innovation lies in synthesizing training data using the SMPL body model from existing Mocap datasets, allowing the network to learn a mapping from IMU signals to full body poses. A bidirectional RNN architecture incorporating long short-term memory (LSTM) units is utilized to capture temporal dependencies, leveraging both past and future information during training. At test time, the network operates in a sliding window mode to maintain real-time performance.

Dataset and Training

To compensate for the limited availability of real IMU datasets, the authors synthesize IMU data from extensive Mocap datasets (AMASS), generating virtual IMU readings. A supplementary real IMU dataset, DIP-IMU, was recorded to address discrepancies between synthetic and real data. This consists of data from 10 subjects performing a variety of motions, structured to provide comprehensive representation and reduce the domain gap between synthetic training data and real-world applications.

The model training incorporates a log-likelihood loss function to predict pose parameters, with an additional loss term for reconstructing IMU acceleration data. This auxiliary task is crucial for encouraging the network to propagate acceleration information throughout the network layers, improving pose estimation accuracy.

Results and Implications

The proposed Deep Inertial Poser achieves competitive performance in both offline and online evaluations compared to the state-of-the-art SIP method, but with the advantage of real-time capabilities. Quantitative evaluation on TotalCapture yields a joint angle error of approximately 15.85° with the bidirectional RNN, outperforming SIP, which optimizes poses offline. Fine-tuning on the DIP-IMU dataset further enhances model performance, particularly for complex and varied motion types.

Qualitative assessments demonstrate the model's robust reconstruction across different datasets, including unseen real-world scenarios, and highlight its ability to operate in everyday settings without extensive instrumentation.

Future Directions

The research presents several avenues for future development:

Multi-Person Interaction: Extending the model to capture interactions between multiple individuals and with objects poses an exciting challenge, potentially integrating vision-based or additional sensory inputs.
Global Position Estimation: Current methods omit global translation, a limitation for some applications. Incorporating GPS data or directly predicting translations using IMUs may enhance this aspect.
Acceleration Robustness: Improved modeling of acceleration data and noise could further refine pose prediction accuracy, especially for challenging motions such as leg raises.

Overall, the paper contributes significantly to the field of real-time human pose estimation from sparse sensors, with implications for VR/AR applications and human-computer interaction frameworks. The demonstrated ability to operate in diverse environments positions Deep Inertial Poser as a pivotal step toward infrastructure-free motion capture solutions.