VIDIMU: Multimodal Kinematic Benchmark

Updated 5 October 2025

VIDIMU is a multimodal kinematic benchmark containing synchronized video (30 Hz) and custom IMU (50 Hz) data for 13 daily activities.
The dataset supports clinically relevant movement analysis and telerehabilitation by providing validated joint angles and raw motion signals.
VIDIMU enables benchmarking of human activity recognition and 3D pose estimation, offering quantifiable metrics such as RMSE and Pearson correlation.

The VIDIMU dataset is a multimodal kinematic benchmark designed for the analysis and recognition of daily life activities in out-of-the-lab settings. Composed of synchronized video and inertial sensor recordings, VIDIMU is specifically curated to enable clinically relevant movement assessment for telerehabilitation and biomechanics research. The resource comprises annotated video at 30 Hz and custom IMU data at 50 Hz (downsampled as needed), with validated protocols emphasizing minimal disturbance to natural movement. VIDIMU provides reference-quality joint angles and raw motion signals for 13 activities performed by 54 healthy adults, and supports evaluation and development of human activity recognition algorithms, 3D pose estimation frameworks, and inverse kinematics methods.

1. Dataset Composition and Structure

The VIDIMU dataset consists of annotated recordings of 13 daily life activities, selected for clinical relevance and coverage of both lower- and upper-limb movements. Lower-limb tasks encompass “walk forward”, “walk backward”, “walk along a line”, and “sit-to-stand”, while upper-limb and bimanual maneuvers include “move a bottle from side to side”, “drink from a bottle”, “assemble/disassemble a LEGO tower”, “throw up and catch a ball”, “reach up for a bottle”, and “tear paper, make a ball, and throw it”.

Recordings were obtained from 54 healthy subjects (age 25.0 ± 5.4 years; 36 males, 18 females; 46 right-handed, 8 left-handed), with all participants captured using a commodity camera (Microsoft LifeCam Studio, 640×480px, 30 Hz). A subgroup of 16 subjects performed activities with simultaneous IMU recordings, utilizing five custom sensors (quaternion output at 50 Hz) positioned on relevant anatomical segments. The dataset is structured to facilitate comparison between monocular video-based joint tracking and wearable IMU-based kinematic reconstruction (Martínez-Zarzuela et al., 2023, Medrano-Paredes et al., 2 Oct 2025).

Modality	Devices/Tools	Temporal Resolution	Subjects (N)	Activities (N)
Video	Microsoft LifeCam Studio, BodyTrack	30 Hz	54	13
IMU	Custom sensors, OpenSim IK	50 Hz	16	13

2. Recording Protocols and Acquisition Methods

Data acquisition involved a clinically motivated procedure to ensure high-quality ground truth with minimal interference. Each session began with a calibrated neutral “N-pose” to initialize sensor orientation and facilitate IMU-to-body segment registration. IMUs were placed on upper or lower limbs (and the trunk for calibration) according to the activity performed.

Cameras captured RGB video at 30 Hz, while each IMU recorded quaternions (w, x, y, z) at 50 Hz. The dual-modal protocol enables synchronous analysis of gross motor function and precise kinematics, capturing movement dynamics in real-life settings without reliance on expensive laboratory infrastructure. The entire acquisition sought to minimize invasiveness and subject discomfort, supporting telehealth and home-based assessment (Martínez-Zarzuela et al., 2023).

3. Data Processing Pipelines

The dataset leverages state-of-the-art tools for extracting clinically relevant joint angles from both video and inertial modalities. Video streams are processed with NVIDIA BodyTrack (Maxine-AR-SDK) to extract 3D joint positions. Joint angles are computed through vector analysis:

$\theta = \cos^{-1}{\left(\frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{\|\mathbf{v}_1\| \|\mathbf{v}_2\|}\right)}$

where $\mathbf{v}_1$ and $\mathbf{v}_2$ denote adjacent limb segment vectors.

IMU streams are converted from raw quaternions using a full-body musculoskeletal model (adapted from Rajagopal et al.) in OpenSim, with an “IMU placer” tool for alignment and inverse kinematics (IK) reconstruction. Weighting schemes reduce drift effects, especially for peripheral sensors.

To enable multimodal fusion, signals are temporally resampled (IMU 50 Hz → 30 Hz), smoothed (moving average, window=5), and globally synchronized using an RMSE-minimizing shift based on initial samples (typically 180, ≈6s). This ensures high-fidelity comparison and integration of video- and IMU-derived angle trajectories (Martínez-Zarzuela et al., 2023, Medrano-Paredes et al., 2 Oct 2025).

4. Validation, Quality Control, and Benchmarking

Dataset validation incorporated visual and algorithmic checks. For video, joint position outputs and corresponding angle signals were scrutinized for anatomical plausibility; anomalous flat-line signals indicated tracking failures. For IMU data, quaternion curves and IK outputs were checked for consistency within and across activities.

Expert qualitative assessment compared OpenSim reconstructions and video records for congruence. Integration with benchmark studies, notably (Medrano-Paredes et al., 2 Oct 2025), provides quantifiable performance metrics for downstream models:

RMSE (degrees): Measures error magnitude; e.g., MotionAGFormer reached $9.27^\circ \pm 4.80^\circ$ .
MAE (degrees): Average absolute error; $7.86^\circ \pm 4.18^\circ$ for MotionAGFormer.
Pearson correlation: Temporal alignment; $0.86 \pm 0.15$ .
$R^{2}$ coefficient of determination: Explained variance; $0.67 \pm 0.28$ .

Such validations confirm the dataset’s suitability for pose estimation, telehealth assessment, and biomechanics (Martínez-Zarzuela et al., 2023, Medrano-Paredes et al., 2 Oct 2025).

5. Analysis Methods and Benchmarking Models

VIDIMU has been used as a reference for benchmarking deep learning-based video pose estimators against IMU-based ground truth. Key models assessed include MotionAGFormer, MotionBERT, MMPose’s three-stage 2D-to-3D lifting pipeline, and NVIDIA BodyTrack. All models produced 3D joint coordinates harmonized to the Human3.6M keypoint set (17 joints), with joint angles computed via the dot-product formula above.

Signal denoising involved median and moving-average filtering. Outputs were normalized and synchronized for pointwise comparison with IMU/OpenSim angles (Medrano-Paredes et al., 2 Oct 2025).

Performance metrics highlighted distinct strengths: MotionAGFormer outperformed peers (lowest RMSE and MAE, highest correlation and $R^2$ ), whereas MotionBERT lagged in error and variance explanation (RMSE $12.28^\circ \pm 4.59^\circ$ , MAE $10.15^\circ \pm 3.86^\circ$ , $R^{2} = 0.16 \pm 0.50$ ). Results demonstrate that video-based frameworks are approaching clinical viability in healthy adults, but gaps remain on precision and robustness—particularly for rapid and occluded movements.

6. Use Cases and Applications

Derived from real-world clinical and daily activities, VIDIMU is targeted at:

Telerehabilitation: Facilitating affordable patient monitoring and adaptive exercise supervision outside laboratory environments.
Human Activity Recognition: Supporting the development and validation of algorithms for classifying motor tasks via multimodal input signals.
Biomechanics and Movement Forecasting: Enabling research in kinematics, musculoskeletal modeling, and predictive assessment of joint angles and ranges.

A plausible implication is the dataset’s utility for developing scalable, cost-effective assessment systems—an essential requirement in telehealth and remote patient care. Its multimodal design permits algorithmic fusion, enhancing robustness and specificity beyond single-modality datasets (Martínez-Zarzuela et al., 2023, Medrano-Paredes et al., 2 Oct 2025).

7. Limitations and Prospective Directions

All recordings in VIDIMU comprise healthy subjects; as emphasized in (Medrano-Paredes et al., 2 Oct 2025), generalization to pathological cohorts is not established. A plausible implication is the need for extending dataset coverage to include clinical populations with movement disorders or impairments.

Comparative results between video and IMU approaches reveal key trade-offs (“costs, accessibility, and precision”): video methods leverage ubiquitous technology but are sensitive to occlusions and motion blur, while IMU methods demand careful sensor placement and may experience drift.

Future directions may involve expanding activity repertoires, refining integration protocols, incorporating additional sensing modalities, and enabling real-time feedback in telemedicine deployments. The open-access nature and use of standard benchmarking metrics position VIDIMU as a key resource for further methodological innovation in movement science (Martínez-Zarzuela et al., 2023, Medrano-Paredes et al., 2 Oct 2025).