Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 218 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities (2510.02264v1)

Published 2 Oct 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27\deg \pm 4.80\deg$) and MAE ($7.86\deg \pm 4.18\deg$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.

Summary

The paper presents a preclinical benchmark comparing four deep learning-based monocular video 3D human pose estimators with IMU-derived joint angles.
The study employs the VIDIMU dataset with 54 subjects performing 13 daily activities to evaluate metrics like RMSE, MAE, correlation, and R².
Results highlight MotionAGFormer’s superior accuracy while emphasizing trade-offs in inference speed and transparency for real-world deployment.

Benchmarking Monocular Video-Based 3D Human Pose Estimation Against IMUs for Kinematic Assessment

Introduction

This paper presents a rigorous preclinical benchmark of state-of-the-art monocular video-based 3D human pose estimation (HPE) models against inertial measurement unit (IMU)-derived joint angles for daily living activities. The motivation is rooted in the need for accessible, cost-effective, and accurate kinematic assessment tools for telemedicine, rehabilitation, and sports science, particularly outside controlled laboratory environments. The benchmark leverages the VIDIMU dataset, which provides synchronized video and IMU recordings of healthy adults performing a diverse set of clinically relevant activities. The evaluation focuses on joint angle estimation accuracy, using IMU-based OpenSim inverse kinematics as the reference, and provides actionable insights for deployment in real-world, out-of-the-lab scenarios.

Methods

Dataset and Acquisition Protocol

The VIDIMU dataset comprises 54 healthy adults (16 with both video and IMU data) performing 13 daily activities, spanning both upper- and lower-limb tasks. Video was captured at 30 Hz using commodity cameras, while five custom IMUs sampled at 50 Hz were placed on key body segments. The protocol emphasizes camera placement and subject orientation tailored to the activity, as these factors directly impact HPE accuracy.

Evaluated Models

Four representative monocular 3D HPE models were benchmarked:

MotionAGFormer: A hybrid Transformer-GCN architecture that fuses global spatio-temporal attention with local joint dependencies via AGFormer blocks. It predicts full pose sequences and is optimized for both accuracy and computational efficiency.
MotionBERT: Employs a dual-stream spatio-temporal Transformer (DSTFormer) and self-supervised pretraining on corrupted 2D skeletons to learn robust motion priors, with transferability to multiple human-centric tasks.
MMPose (3-stage 2D-to-3D lifting): A modular pipeline combining RTMDet for detection, RTMPose for 2D keypoint estimation, and VideoPoseLift (TCN) for temporal 2D-to-3D lifting, leveraging domain-specific pretraining.
NVIDIA BodyTrack: A proprietary, closed-source SDK for end-to-end 3D pose estimation from RGB video, included for continuity with the original VIDIMU pipeline and relevance to applied AR/VR scenarios.

Data Processing and Harmonization

All models' outputs were harmonized to the Human3.6M 17-joint format. Joint angles were computed via vector-based methods using the dot product and arccosine, with filtering (median and moving average) and temporal alignment to IMU signals. IMU data were processed through OpenSim inverse kinematics, downsampled, filtered, and synchronized to the video-derived signals.

Evaluation Metrics

Performance was quantified using:

RMSE and MAE (degrees): Absolute angular error, directly interpretable for clinical kinematics.
NRMSE: RMSE normalized by the range of ground truth angles, enabling cross-joint/activity comparison.
Pearson correlation: Temporal agreement between predicted and reference angle trajectories.
R²: Proportion of variance in IMU signals explained by the model.

Metrics were computed per subject, per activity, and aggregated across the dataset.

Results

Overall Model Performance

Model	RMSE (°)	MAE (°)	NRMSE	Correlation	R²
MotionAGFormer	9.27 ± 4.80	7.86 ± 4.18	0.14 ± 0.06	0.86 ± 0.15	0.67 ± 0.28
MMPose	11.04 ± 4.17	9.35 ± 3.61	0.17 ± 0.05	0.84 ± 0.10	0.58 ± 0.26
BodyTrack	10.89 ± 3.67	9.00 ± 3.12	0.17 ± 0.05	0.78 ± 0.12	0.44 ± 0.31
MotionBERT	12.28 ± 4.59	10.15 ± 3.86	0.20 ± 0.06	0.79 ± 0.11	0.16 ± 0.50

MotionAGFormer consistently outperformed other models across all metrics, with the lowest RMSE and MAE, and the highest correlation and R². MotionBERT exhibited the highest errors and lowest explained variance, while MMPose and BodyTrack provided intermediate results.

Activity-Specific Analysis

Lower-limb activities (e.g., walking, sit-to-stand): All models achieved lower errors, with MMPose and BodyTrack excelling in specific tasks (e.g., walk_forward RMSE: 6.54° for MMPose; walk_backward RMSE: 4.72° for BodyTrack).
Upper-limb and complex bimanual activities: Performance differences were more pronounced. MotionAGFormer and MMPose maintained moderate errors and robust correlations, while MotionBERT's performance deteriorated, especially in tasks requiring fine-grained kinematics or under occlusion.
Temporal agreement: Correlation coefficients were generally high for lower-limb tasks and decreased for upper-limb or occluded activities, reflecting the increased challenge of these scenarios.

Trade-offs and Deployment Considerations

MotionAGFormer: Highest accuracy, but with increased inference time, making it less suitable for real-time applications without further optimization.
MotionBERT: Fastest inference, but less robust for fine-grained or dynamic tasks; pretraining on corrupted skeletons improves generalization to noise but not to all activity types.
MMPose: Modular and efficient, with strong performance in lower-limb/gait tasks; performance degrades for upper-limb or prolonged, non-repetitive motions due to limited temporal windowing.
BodyTrack: Balanced performance and real-time capability, but closed-source nature limits transparency, adaptability, and academic scrutiny.

Discussion

The benchmark demonstrates that monocular video-based 3D HPE models are viable alternatives to IMUs for out-of-the-lab kinematic assessment, with MotionAGFormer providing the best overall accuracy. The dual-stream hybrid architecture of MotionAGFormer, integrating global and local spatio-temporal dependencies, is particularly effective for complex, dynamic activities. However, the increased computational cost and inference time must be considered for real-time or resource-constrained deployments.

The results also highlight that no single model is universally optimal. Model selection should be guided by the specific clinical or application context, considering the trade-offs between accuracy, inference speed, hardware requirements, and ease of integration. For example, MMPose is preferable for gait analysis in telehealth, while MotionAGFormer is better suited for high-precision, complex movement analysis in sports or advanced rehabilitation.

The paper's limitations include the use of only healthy adults, a limited number of activities, and the absence of optical motion capture as a gold standard. IMU calibration and placement errors, as well as the closed-source nature of BodyTrack, introduce additional sources of bias and limit reproducibility.

Implications and Future Directions

The findings support the adoption of video-based 3D HPE for scalable, non-intrusive kinematic assessment in telemedicine and remote monitoring. Future research should focus on:

Hybrid video-IMU systems: Combining modalities to leverage complementary strengths and improve robustness in real-world conditions.
Domain adaptation and fine-tuning: Tailoring models to clinical populations and diverse environments via supervised and self-supervised strategies.
Dataset expansion: Including pathological cohorts and challenging scenarios (lighting, occlusion, multi-person) to enhance generalizability.
Clinical outcome integration: Mapping objective kinematic features to established clinical scales for actionable decision support.

Parameter-efficient adaptation, cross-modal distillation, and realistic data augmentation are promising avenues for improving model robustness and deployment on edge devices.

Conclusion

This benchmark establishes a transparent, reproducible pipeline for evaluating monocular video-based 3D HPE models against IMU-derived joint angles in daily living activities. MotionAGFormer achieves the best overall accuracy, but model selection should be context-dependent, balancing precision, speed, and operational constraints. The paper provides a foundation for future clinical validation and the development of robust, accessible kinematic assessment tools for telehealth and beyond.