TNT15: Benchmark for Sparse IMU Pose Estimation

Updated 3 February 2026

TNT15 dataset is a benchmark for evaluating human pose estimation by using sparse inertial measurements and SMPL-based calibration.
It employs six body-mounted IMUs and joint optimization, achieving a reduction in orientation error by ~32% and position error by ~46% compared to baseline methods.
The dataset provides controlled multi-IMU recordings with explicit protocols for error quantification, enabling standardized assessment of pose reconstruction pipelines.

The TNT15 dataset is a benchmark for evaluating human pose estimation algorithms using sparse inertial measurement units (IMUs). It was employed in the context of the Sparse Inertial Poser (SIP) work to advance the practical use of human motion capture with only six body-mounted IMUs and without any external video. TNT15 provides carefully controlled multi-IMU recordings, ground-truth reference poses, and explicit protocols for error quantification and benchmarking, focusing on unconstrained, naturalistic actions (Marcard et al., 2017).

1. Dataset Composition and Acquisition

TNT15 comprises data from four healthy adult subjects, each performing five distinct actions—examples include jumping-jacks, skiing stance, boxing (warm-up), walking/running, and free boxing. Each motion sequence lasts approximately 20–40 seconds, yielding a total of 20 sequences (4 subjects × 5 actions). IMU hardware consists of Xsens MTw units, incorporating a 3-axis accelerometer, gyroscope, and magnetometer; these record raw sensor streams at 60 Hz. Each IMU's onboard Kalman filter supplies per-sensor orientation $R^{GS}(t)\in SO(3)$ in a global inertial frame $F_I$ , and specific force measurements $a^{S}(t)\in\mathbb{R}^3$ in the local sensor frame $F_S$ . The conversion to global acceleration is $a^G(t) = R^{GS}(t) a^S(t) - g$ , with $g$ the gravity vector.

Ground-truth joint poses are synthesized from all ten IMUs using the full Xsens commercial pipeline. No optical system is used; instead, virtual markers (hips, knees, ankles, shoulders, elbows, wrists, neck) are instantiated on each subject's statistically parameterized SMPL model. Subject-specific shape parameters $\beta$ are either fitted from 3D laser scans or, where unavailable, provided using the “bodies from words” approach. A unique SMPL mesh vertex is manually chosen per sensor location, consistent across all participants. Constant sensor-to-bone offsets are solved from the upright T-pose $P_0$ with $G_{BS}=G_{BG}(0)\cdot G_{GS}(0)$ .

2. Sensor Configuration and Data Format

Each TNT15 sequence contains ten IMU streams: six “tracking” sensors (lower legs, lower arms, waist, chest) are input to algorithms, while four “validation” sensors (thighs, upper arms) remain unused for prediction but provide independent error measurements. Each sequence includes:

Per-sensor orientation streams $R^{GS}_n(t)$ for $n=1\dots 10$
Per-sensor accelerations $a^S_n(t)$
SMPL parameters $\beta$ and joint limits

Coordinate frames comprise $F_S$ (sensor), $F_I$ (inertial), and $F_G$ (tracking/SMPL), with a known global tracking-inertial transform $G_{GI}$ . The sensor-to-mesh mapping and calibration are consistent across all records.

3. Evaluation Protocols and Metrics

TNT15 imposes rigorous benchmarking conventions. There is no train/test split; all 20 sequences are processed in full. Each sequence accepts orientation and acceleration from the 6 tracking IMUs; the 4 validation IMUs are withheld for evaluation only.

Orientation error at held-out validation IMUs ( $N_v=4$ ) is defined as:

$d_{ori} = \frac{1}{T N_v} \sum_{t=1}^T \sum_{n=1}^{N_v} \|\mathbf{e}_{ori,n}(t)\|^2$

with:

$\mathbf{e}_{ori}(x_t) = \log\bigl(\hat{R}^{GS}(x_t) R^{GS}(t)^{-1}\bigr)^\vee \in \mathbb{R}^3$

where $\hat{R}^{GS}(x_t)$ is the estimated and $R^{GS}(t)$ the measured orientation.

Position error for $N_m=13$ virtual markers is:

$d_{pos} = \frac{1}{T N_m} \sum_{t=1}^T \sum_{m=1}^{N_m} \|\mathbf{p}_m^{gt}(t) - \mathbf{p}_m^{est}(t)\|^2$

with $\mathbf{p}_m^{gt}$ and $\mathbf{p}_m^{est}$ denoting marker positions from the ground-truth and estimated SMPL pose.

4. Baseline Methods and Results

Experiments on TNT15 compare multiple pipeline variants for pose reconstruction:

Approach	$\mu_{ang}$ (°)	$\sigma_{ang}$ (°)	$\mu_{pos}$ (m)	$\sigma_{pos}$ (m)
SOP	19.64	17.35	0.072	0.089
SIP-M	18.24	15.82	0.060	0.053
SIP	13.32	10.13	0.039	0.040
SIP-BW	13.45	9.94	0.042	0.040
SIP-110	13.67	10.38	0.046	0.045
SIP-120	14.27	10.60	0.056	0.053

SOP (“Sparse Orientation Poser”): uses only orientation data and an anthropometric prior.
SIP-M: SIP pipeline with a manually rigged 31-parameter body model.
SIP: Joint optimization over orientation, acceleration, and SMPL anthropometrics.
SIP-BW/SIP-110/SIP-120: Body-shape variants.

Motion-specific analysis indicates that dynamic activities such as jumping-jacks and leg-intensive motions benefit markedly from explicit acceleration integration. SOP frequently fails to reconstruct “foot stamp” events, while SIP-M reduces ambiguity but is hampered by limited model expressiveness, particularly for torso flexion. SIP achieves a decrease in orientation error by approximately 32% (19.6→13.3°) and position error by roughly 46% (7.2→3.9 cm) compared to SOP.

5. SMPL Body Model Integration and Subject Calibration

The SMPL model used in TNT15 encodes the human body as a mesh of 6890 vertices and 24 joints, supporting 75 degrees of freedom per pose. For each subject, the body shape parameter $\beta$ is personalized via 3D laser scans or, if scans are unavailable, via the text-based “bodies from words” protocol. Sensor registration is consistent through manual mesh-vertex selection. Initial alignment exploits the upright T-pose to estimate per-subject, per-sensor offsets.

6. Strengths, Limitations, and Recommendations

TNT15 demonstrates that, when combined with the SMPL model and a joint orientation-acceleration optimization as in SIP, six IMUs yield mean errors of ≈4 cm (position) and ≈13° (orientation) across a range of motions. However, several limitations persist:

Global translation tends to drift over long sequences due to the absence of explicit ground constraints.
Wrist and ankle degrees of freedom are under-constrained by the current sensor set and rely heavily on anthropometric priors.

Recommended directions for future TNT15 utilization include: implementing foot-contact or center-of-mass constraints to mitigate drift, expanding sensor layouts (e.g., feet, hands) for improved end-effector fidelity, augmenting sensor setups with small rigid-body markers or pressure insoles to detect contacts, leveraging “bodies from words” for rapid model acquisition, and using the four held-out IMUs for robust cross-validation or leave-one-subject-out studies. Extending the activity range to encompass social or non-periodic daily activities (e.g., sitting, carrying objects, interpersonal scenarios) is advised (Marcard et al., 2017).

7. Context and Role within Sparse IMU Pose Estimation

TNT15 plays a pivotal role in the development and evaluation of algorithms that reconstruct full-body pose from sparse, body-mounted IMU data. It establishes a benchmark for comparing diverse methodological advancements, under well-documented conditions, using precise and reproducible metrics. Its structure and protocols enable systematic assessment of pose estimation frameworks, informing the trade-offs inherent in sensor placement, model complexity, and error sources specific to real-world, non-laboratory dynamics.

Markdown Report Issue Upgrade to Chat

References (1)

Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TNT15 Dataset.