- The paper presents a preclinical benchmark comparing four deep learning-based monocular video 3D human pose estimators with IMU-derived joint angles.
- The study employs the VIDIMU dataset with 54 subjects performing 13 daily activities to evaluate metrics like RMSE, MAE, correlation, and R².
- Results highlight MotionAGFormer’s superior accuracy while emphasizing trade-offs in inference speed and transparency for real-world deployment.
Benchmarking Monocular Video-Based 3D Human Pose Estimation Against IMUs for Kinematic Assessment
Introduction
This paper presents a rigorous preclinical benchmark of state-of-the-art monocular video-based 3D human pose estimation (HPE) models against inertial measurement unit (IMU)-derived joint angles for daily living activities. The motivation is rooted in the need for accessible, cost-effective, and accurate kinematic assessment tools for telemedicine, rehabilitation, and sports science, particularly outside controlled laboratory environments. The benchmark leverages the VIDIMU dataset, which provides synchronized video and IMU recordings of healthy adults performing a diverse set of clinically relevant activities. The evaluation focuses on joint angle estimation accuracy, using IMU-based OpenSim inverse kinematics as the reference, and provides actionable insights for deployment in real-world, out-of-the-lab scenarios.
Methods
Dataset and Acquisition Protocol
The VIDIMU dataset comprises 54 healthy adults (16 with both video and IMU data) performing 13 daily activities, spanning both upper- and lower-limb tasks. Video was captured at 30 Hz using commodity cameras, while five custom IMUs sampled at 50 Hz were placed on key body segments. The protocol emphasizes camera placement and subject orientation tailored to the activity, as these factors directly impact HPE accuracy.
Evaluated Models
Four representative monocular 3D HPE models were benchmarked:
- MotionAGFormer: A hybrid Transformer-GCN architecture that fuses global spatio-temporal attention with local joint dependencies via AGFormer blocks. It predicts full pose sequences and is optimized for both accuracy and computational efficiency.
- MotionBERT: Employs a dual-stream spatio-temporal Transformer (DSTFormer) and self-supervised pretraining on corrupted 2D skeletons to learn robust motion priors, with transferability to multiple human-centric tasks.
- MMPose (3-stage 2D-to-3D lifting): A modular pipeline combining RTMDet for detection, RTMPose for 2D keypoint estimation, and VideoPoseLift (TCN) for temporal 2D-to-3D lifting, leveraging domain-specific pretraining.
- NVIDIA BodyTrack: A proprietary, closed-source SDK for end-to-end 3D pose estimation from RGB video, included for continuity with the original VIDIMU pipeline and relevance to applied AR/VR scenarios.
Data Processing and Harmonization
All models' outputs were harmonized to the Human3.6M 17-joint format. Joint angles were computed via vector-based methods using the dot product and arccosine, with filtering (median and moving average) and temporal alignment to IMU signals. IMU data were processed through OpenSim inverse kinematics, downsampled, filtered, and synchronized to the video-derived signals.
Evaluation Metrics
Performance was quantified using:
- RMSE and MAE (degrees): Absolute angular error, directly interpretable for clinical kinematics.
- NRMSE: RMSE normalized by the range of ground truth angles, enabling cross-joint/activity comparison.
- Pearson correlation: Temporal agreement between predicted and reference angle trajectories.
- R²: Proportion of variance in IMU signals explained by the model.
Metrics were computed per subject, per activity, and aggregated across the dataset.
Results
Model |
RMSE (°) |
MAE (°) |
NRMSE |
Correlation |
R² |
MotionAGFormer |
9.27 ± 4.80 |
7.86 ± 4.18 |
0.14 ± 0.06 |
0.86 ± 0.15 |
0.67 ± 0.28 |
MMPose |
11.04 ± 4.17 |
9.35 ± 3.61 |
0.17 ± 0.05 |
0.84 ± 0.10 |
0.58 ± 0.26 |
BodyTrack |
10.89 ± 3.67 |
9.00 ± 3.12 |
0.17 ± 0.05 |
0.78 ± 0.12 |
0.44 ± 0.31 |
MotionBERT |
12.28 ± 4.59 |
10.15 ± 3.86 |
0.20 ± 0.06 |
0.79 ± 0.11 |
0.16 ± 0.50 |
MotionAGFormer consistently outperformed other models across all metrics, with the lowest RMSE and MAE, and the highest correlation and R². MotionBERT exhibited the highest errors and lowest explained variance, while MMPose and BodyTrack provided intermediate results.
Activity-Specific Analysis
- Lower-limb activities (e.g., walking, sit-to-stand): All models achieved lower errors, with MMPose and BodyTrack excelling in specific tasks (e.g., walk_forward RMSE: 6.54° for MMPose; walk_backward RMSE: 4.72° for BodyTrack).
- Upper-limb and complex bimanual activities: Performance differences were more pronounced. MotionAGFormer and MMPose maintained moderate errors and robust correlations, while MotionBERT's performance deteriorated, especially in tasks requiring fine-grained kinematics or under occlusion.
- Temporal agreement: Correlation coefficients were generally high for lower-limb tasks and decreased for upper-limb or occluded activities, reflecting the increased challenge of these scenarios.
Trade-offs and Deployment Considerations
- MotionAGFormer: Highest accuracy, but with increased inference time, making it less suitable for real-time applications without further optimization.
- MotionBERT: Fastest inference, but less robust for fine-grained or dynamic tasks; pretraining on corrupted skeletons improves generalization to noise but not to all activity types.
- MMPose: Modular and efficient, with strong performance in lower-limb/gait tasks; performance degrades for upper-limb or prolonged, non-repetitive motions due to limited temporal windowing.
- BodyTrack: Balanced performance and real-time capability, but closed-source nature limits transparency, adaptability, and academic scrutiny.
Discussion
The benchmark demonstrates that monocular video-based 3D HPE models are viable alternatives to IMUs for out-of-the-lab kinematic assessment, with MotionAGFormer providing the best overall accuracy. The dual-stream hybrid architecture of MotionAGFormer, integrating global and local spatio-temporal dependencies, is particularly effective for complex, dynamic activities. However, the increased computational cost and inference time must be considered for real-time or resource-constrained deployments.
The results also highlight that no single model is universally optimal. Model selection should be guided by the specific clinical or application context, considering the trade-offs between accuracy, inference speed, hardware requirements, and ease of integration. For example, MMPose is preferable for gait analysis in telehealth, while MotionAGFormer is better suited for high-precision, complex movement analysis in sports or advanced rehabilitation.
The paper's limitations include the use of only healthy adults, a limited number of activities, and the absence of optical motion capture as a gold standard. IMU calibration and placement errors, as well as the closed-source nature of BodyTrack, introduce additional sources of bias and limit reproducibility.
Implications and Future Directions
The findings support the adoption of video-based 3D HPE for scalable, non-intrusive kinematic assessment in telemedicine and remote monitoring. Future research should focus on:
- Hybrid video-IMU systems: Combining modalities to leverage complementary strengths and improve robustness in real-world conditions.
- Domain adaptation and fine-tuning: Tailoring models to clinical populations and diverse environments via supervised and self-supervised strategies.
- Dataset expansion: Including pathological cohorts and challenging scenarios (lighting, occlusion, multi-person) to enhance generalizability.
- Clinical outcome integration: Mapping objective kinematic features to established clinical scales for actionable decision support.
Parameter-efficient adaptation, cross-modal distillation, and realistic data augmentation are promising avenues for improving model robustness and deployment on edge devices.
Conclusion
This benchmark establishes a transparent, reproducible pipeline for evaluating monocular video-based 3D HPE models against IMU-derived joint angles in daily living activities. MotionAGFormer achieves the best overall accuracy, but model selection should be context-dependent, balancing precision, speed, and operational constraints. The paper provides a foundation for future clinical validation and the development of robust, accessible kinematic assessment tools for telehealth and beyond.