- The paper introduces Deep Inertial Poser, a deep learning approach using RNNs to reconstruct 3D human pose in real-time from only six sparse IMU sensors.
- Achieves competitive pose accuracy (approx 15.85° joint error on TotalCapture) compared to state-of-the-art offline methods like SIP, while operating in real time.
- Enables robust, real-time human pose estimation in diverse environments, paving the way for infrastructure-free motion capture in VR/AR and HCI applications.
Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time
The paper "Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time" introduces a novel approach to human pose estimation using only six Inertial Measurement Units (IMUs). The proposed method leverages a deep neural network architecture designed to predict 3D human body pose in real-time, a significant departure from previous offline optimization approaches that rely on dense IMU placement or camera-based systems requiring controlled setups.
Methodology Overview
The authors address the inherent challenge of reconstructing human pose from sparse IMUs by employing a deep learning approach based on recurrent neural networks (RNNs). The key innovation lies in synthesizing training data using the SMPL body model from existing Mocap datasets, allowing the network to learn a mapping from IMU signals to full body poses. A bidirectional RNN architecture incorporating long short-term memory (LSTM) units is utilized to capture temporal dependencies, leveraging both past and future information during training. At test time, the network operates in a sliding window mode to maintain real-time performance.
Dataset and Training
To compensate for the limited availability of real IMU datasets, the authors synthesize IMU data from extensive Mocap datasets (AMASS), generating virtual IMU readings. A supplementary real IMU dataset, DIP-IMU, was recorded to address discrepancies between synthetic and real data. This consists of data from 10 subjects performing a variety of motions, structured to provide comprehensive representation and reduce the domain gap between synthetic training data and real-world applications.
The model training incorporates a log-likelihood loss function to predict pose parameters, with an additional loss term for reconstructing IMU acceleration data. This auxiliary task is crucial for encouraging the network to propagate acceleration information throughout the network layers, improving pose estimation accuracy.
Results and Implications
The proposed Deep Inertial Poser achieves competitive performance in both offline and online evaluations compared to the state-of-the-art SIP method, but with the advantage of real-time capabilities. Quantitative evaluation on TotalCapture yields a joint angle error of approximately 15.85° with the bidirectional RNN, outperforming SIP, which optimizes poses offline. Fine-tuning on the DIP-IMU dataset further enhances model performance, particularly for complex and varied motion types.
Qualitative assessments demonstrate the model's robust reconstruction across different datasets, including unseen real-world scenarios, and highlight its ability to operate in everyday settings without extensive instrumentation.
Future Directions
The research presents several avenues for future development:
- Multi-Person Interaction: Extending the model to capture interactions between multiple individuals and with objects poses an exciting challenge, potentially integrating vision-based or additional sensory inputs.
- Global Position Estimation: Current methods omit global translation, a limitation for some applications. Incorporating GPS data or directly predicting translations using IMUs may enhance this aspect.
- Acceleration Robustness: Improved modeling of acceleration data and noise could further refine pose prediction accuracy, especially for challenging motions such as leg raises.
Overall, the paper contributes significantly to the field of real-time human pose estimation from sparse sensors, with implications for VR/AR applications and human-computer interaction frameworks. The demonstrated ability to operate in diverse environments positions Deep Inertial Poser as a pivotal step toward infrastructure-free motion capture solutions.