D-MAPS: Metric-Aware Pose and Shape Estimation

Updated 4 July 2026

The paper introduces D-MAPS, which fuses RGB and depth features to compute initial human pose, shape, and calibrated bone lengths, ensuring metric consistency and reducing scale drift.
It leverages depth confidences and sequence-level bone calibration to overcome depth ambiguity and maintain temporal and anthropometric stability in both monocular and stereo-inertial setups.
Empirical results demonstrate improved MPJPE, PA-MPJPE, and drift-free translation, highlighting its potential for robotics, AR/VR, and biomechanical applications.

Depth-guided Metric-Aware Pose and Shape (D-MAPS) is a depth-guided formulation for human mesh recovery in which pose and shape are estimated under explicit metric constraints rather than only up to arbitrary monocular scale. In the monocular video setting, D-MAPS appears as the estimator that uses fused RGB–depth features, depth confidences, and depth-calibrated bone statistics to produce scale-consistent initial pose and shape before temporal refinement (Cen et al., 4 Feb 2026). A broader interpretation is suggested by stereo-inertial motion capture systems that combine metric 3D cues, parametric body models, and kinematic constraints so that global translation, local articulation, and anthropometry remain mutually consistent (Tang et al., 2 Mar 2026).

1. Problem setting and motivating constraints

The problem addressed by D-MAPS-style methods is the recovery of a temporally consistent, metrically correct 3D human mesh from monocular RGB video or related visual-inertial inputs. The central difficulty is that monocular RGB is fundamentally ill-posed: many 3D configurations project to the same 2D evidence, depth along the viewing ray is under-constrained, and body scale can drift across frames as the model trades off body size against camera distance. The literature identifies four recurrent failure modes: depth ambiguity, metric uncertainty and scale drift, incorrect depth ordering under occlusion, and temporal instability (Cen et al., 4 Feb 2026).

A parallel formulation in visual-inertial motion capture exposes the same deficiencies from a different angle. Monocular visual-(inertial) systems suffer from depth ambiguity, produce non-metric trajectories, often remain shape-agnostic in local motion estimation, and accumulate drift when relying solely on IMUs. In that setting, the desired outputs are not only local articulation but also root translation in a metric world coordinate and body shape parameters that control anthropometry. The resulting motivation is explicitly tied to robotics, AR/VR, and clinical or biomechanical uses, where metric scale, shape consistency, and drift-free global position are operational requirements rather than optional refinements (Tang et al., 2 Mar 2026).

Within this framing, D-MAPS is not merely a temporal smoother. The cited work states that temporal smoothing mitigates jitter but does not fix wrong depth or wrong scale, and excessive smoothing can suppress fast motion. The design goal is therefore stronger: to inject depth-derived geometry early enough that pose, shape, and translation remain metrically coherent throughout the pipeline rather than being corrected only after the fact (Cen et al., 4 Feb 2026).

2. D-MAPS as a depth-calibrated metric initializer

In the framework where the term is introduced explicitly, D-MAPS is the middle stage of a three-part pipeline: RGB/depth $\rightarrow$ Fusion $\rightarrow$ D-MAPS $\rightarrow$ MoDAR. The upstream module produces a fused per-frame feature

$\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$

where $F_r$ and $F_d$ are RGB and depth streams and $\mathbf{q}_r,\mathbf{q}_d$ are channel-wise gates. D-MAPS then uses $\tilde{F}_t$ , local depth patches, framewise depth confidence, lifted 3D joints, and template bone-length statistics to estimate initial pose $p_{\text{init}}$ , initial shape $s_{\text{init}}$ , and calibrated bone lengths $\rightarrow$ 0 (Cen et al., 4 Feb 2026).

Its pose construction is kinematic. For each bone $\rightarrow$ 1, normalized bone directions are computed as

$\rightarrow$ 2

A swing rotation is analytically constructed from these depth-aligned directions, while a twist rotation is regressed from $\rightarrow$ 3 and local depth patches. The final joint rotation is the composition

$\rightarrow$ 4

This decomposition assigns the depth-derived 3D geometry the role of constraining bone direction, while the learned component resolves rotation around the bone axis, especially under occlusion and depth-ordering ambiguity (Cen et al., 4 Feb 2026).

The metric aspect is concentrated in sequence-level bone calibration. A temporal depth-confidence weight is defined by

$\rightarrow$ 5

and depth-based bone lengths are aggregated over time as

$\rightarrow$ 6

A sequence-level gate

$\rightarrow$ 7

then fuses the sequence estimate with a template prior:

$\rightarrow$ 8

When depth is reliable, subject-specific metric bone lengths dominate; when depth is unreliable, the method falls back toward template statistics. These calibrated lengths rescale the rest-pose skeleton along the kinematic tree and initialize a shape regressor described as an analytically initialized MLP, thereby creating a subject-specific metric skeleton that remains consistent over time (Cen et al., 4 Feb 2026).

The output of D-MAPS is therefore best understood as a scale-consistent anchor. MoDAR subsequently refines articulation and temporal coherence, but the sequence-level metric scale is already fixed through $\rightarrow$ 9, which is why the method is reported to reduce scale drift rather than merely smooth it (Cen et al., 4 Feb 2026).

3. Metric depth and subject-specific shape in stereo-inertial realizations

A concrete realization of D-MAPS principles appears in stereo-inertial motion capture. The system takes a single fixed stereo camera with known intrinsics and extrinsics and six IMUs mounted on pelvis, head, forearms, and lower legs. Its outputs are SMPL pose parameters $\rightarrow$ 0, root translation $\rightarrow$ 1 in a metric world coordinate aligned with the stereo rig, SMPL shape parameters $\rightarrow$ 2, and foot-ground contact probabilities $\rightarrow$ 3 (Tang et al., 2 Mar 2026).

The key geometric move is to replace monocular RGB with calibrated stereo. For corresponding keypoints

$\rightarrow$ 4

the disparity is

$\rightarrow$ 5

and depth is recovered by

$\rightarrow$ 6

which is the standard stereo relation $\rightarrow$ 7. MediaPipe Pose is applied to both views to obtain 2D keypoints, canonical 3D keypoints in a body-root coordinate, and confidences. Confidence-weighted fusion yields root-relative 3D pose $\rightarrow$ 8, and stereo projection yields world-space 3D keypoints $\rightarrow$ 9. Because these keypoints are produced by calibrated baseline geometry, the translation estimate is metrically anchored rather than inferred up to arbitrary scale (Tang et al., 2 Mar 2026).

Subject-specific shape is then estimated explicitly with SMPL. A subject stands in T-pose, stereo matching produces a raw point cloud, the cloud is segmented to the subject’s bounding box and downsampled to about 4000 points, and the same frame provides a world-space skeleton $\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 0. After rigid alignment into the SMPL coordinate system, shape and pose are optimized by minimizing

$\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 1

with

$\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 2

$\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 3

$\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 4

The reported weights are $\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 5, $\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 6, $\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 7, and $\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 8. This produces a personalized, fixed $\tilde{F}_t = \phi\!\big([\mathbf{q}_r \odot F_r \| \mathbf{q}_d \odot F_d]\big),$ 9 for subsequent online inference, so that limb lengths, joint locations, and global translation are coupled through a single anthropometric model rather than a normalized template (Tang et al., 2 Mar 2026).

This stereo-inertial formulation is not labeled D-MAPS in the original title, but it instantiates the same design logic: depth establishes metric 3D structure, a parametric body model personalizes shape, and downstream estimation operates on that metric subject model rather than on scale-free skeletons.

4. Shape-aware fusion, translation estimation, and physical constraints

Once stereo depth and subject-specific shape are available, the stereo-inertial system fuses metric 3D keypoints, root-relative visual keypoints, IMU measurements, and shape parameters through a hierarchy built around state-space models. The component names are TransNet, IENet, KENet, FusionNet, and RefineNet. TransNet estimates global translation, IENet predicts canonical joint positions from IMUs, KENet predicts canonical joint positions from visual keypoints, FusionNet performs shape-aware multimodal fusion, and RefineNet further improves pose, translation, velocity, and contact (Tang et al., 2 Mar 2026).

TransNet receives a subset of nine metric 3D keypoints from $F_r$ 0 together with confidences and predicts both $F_r$ 1 and frame-to-frame translation $F_r$ 2. Its supervision combines absolute translation and cycle consistency:

$F_r$ 3

$F_r$ 4

Because the inputs are metric 3D keypoints derived from stereo, the global trajectory is supervised in true scale rather than normalized units (Tang et al., 2 Mar 2026).

IENet and KENet map two sensing modalities into a common canonical root coordinate. IENet ingests pelvis-frame IMU accelerations and rotations and predicts

$F_r$ 5

while KENet predicts

$F_r$ 6

from normalized root-relative visual keypoints and confidences. This common representation makes later fusion kinematic rather than sensor-specific. FusionNet then consumes $F_r$ 7, $F_r$ 8, visual confidences, shape parameters $F_r$ 9, initial translations, and raw IMU features, and outputs refined translation, joint rotations $F_d$ 0, and foot contact probabilities $F_d$ 1 (Tang et al., 2 Mar 2026).

The shape-aware coupling appears directly in the loss. Pose supervision is

$F_d$ 2

with

$F_d$ 3

$F_d$ 4

and $F_d$ 5. This term does not merely regularize pose parameters; it enforces that predicted rotations generate the correct metric joint positions for the actual body shape. The system also predicts left and right foot contact, penalized by binary cross-entropy, and uses contact to define a foot-skating loss

$F_d$ 6

When a foot is in contact, local foot motion in the root frame must cancel root translation in world space; the loss therefore ties pose, translation, and shape into a single kinematic condition. A jerk loss on joint trajectories further suppresses temporal jitter, and the full loss combines pose, translation, contact, foot-skating, and jerk with reported weights $F_d$ 7, $F_d$ 8, $F_d$ 9, $\mathbf{q}_r,\mathbf{q}_d$ 0, $\mathbf{q}_r,\mathbf{q}_d$ 1, and $\mathbf{q}_r,\mathbf{q}_d$ 2 (Tang et al., 2 Mar 2026).

This architecture makes explicit a central D-MAPS principle: local motion cannot be treated independently from global translation if the reconstructed body has an individualized metric shape. Inconsistency between these quantities produces artifacts such as wrong stride lengths or foot sliding; the cited formulation addresses them by construction rather than by post-processing.

D-MAPS sits within a broader cluster of methods that use depth, metric camera geometry, or metric 3D representations to constrain human reconstruction. These methods do not all use the D-MAPS name, but they illuminate adjacent design choices (Vasilikopoulos et al., 2024, Zhang et al., 11 Jun 2025, Sárándi et al., 2020, Veges et al., 2020).

One neighboring design is to make depth an intermediate representation. D-PoSE predicts a dense human-only depth map and a part segmentation map from a single RGB image, then concatenates depth-informed features, segmentation-based attention features, and bounding-box information before regressing SMPL-X parameters. Its depth supervision is

$\mathbf{q}_r,\mathbf{q}_d$ 3

and the total objective combines depth, segmentation, SMPL parameter, 3D joint, 3D vertex, and 2D reprojection losses. This establishes a different but related route to metric 3D accuracy: depth is not used only for post hoc correction but as a supervised internal cue for pose and shape regression (Vasilikopoulos et al., 2024).

A second neighboring design is metric awareness through camera geometry rather than explicit depth maps. MetricHMR argues for the standard perspective projection model and introduces a ray map in which each pixel is associated with a camera ray

$\mathbf{q}_r,\mathbf{q}_d$ 4

The ray field encodes focal length, principal point, and bounding-box geometry, and is processed by a separate encoder before joint regression of SMPL pose, shape, global rotation, and metric translation. The paper emphasizes that weak-perspective formulations make body size and depth fundamentally ambiguous, whereas known or estimated intrinsics constrain the acceptable metric solution range (Zhang et al., 11 Jun 2025).

A third line of work changes the output representation itself. MeTRo abandons image-aligned 2.5D heatmaps in favor of metric-space volumetric heatmaps defined in a fixed $\mathbf{q}_r,\mathbf{q}_d$ 5 cube around the subject. Because the volume axes are all metric rather than partly image-based, the method predicts complete metric-scale root-relative poses without test-time focal length or person distance and remains robust to joints outside the image boundary (Sárándi et al., 2020). This suggests that D-MAPS can be interpreted not only as a fusion strategy but also as a representational commitment to metric 3D coordinates.

Finally, weak depth supervision offers a looser form of depth guidance. In multi-person absolute 3D pose estimation, one can train on RGB-D images without 3D joint labels by predicting the depth values that should be observed at joint locations and penalizing discrepancies with a Geman–McClure robust loss. That formulation is pose-only rather than pose-and-shape, but it demonstrates that depth can impose metric constraints even when only weakly aligned with the final output space (Veges et al., 2020).

Taken together, these methods show that the “depth-guided” and “metric-aware” elements of D-MAPS can be implemented in several non-equivalent ways: direct depth calibration of bone statistics, dense supervised intermediate depth, explicit perspective-ray conditioning, or metric-space output parameterizations. The common theme is the rejection of purely up-to-scale reconstruction.

6. Empirical behavior, limitations, and interpretive issues

The monocular D-MAPS framework reports competitive or better results on 3DPW, Human3.6M, and MPI-INF-3DHP, with full-system scores of 69.31 mm MPJPE, 46.68 mm PA-MPJPE, 82.61 mm MPVPE, and 7.14 mm/s² Accel on 3DPW; 51.18 mm MPJPE, 35.96 mm PA-MPJPE, and 3.92 mm/s² Accel on Human3.6M; and 73.45 mm MPJPE, 53.87 mm PA-MPJPE, and 7.89 mm/s² Accel on MPI-INF-3DHP. Its 3DPW ablation further shows that depth fusion alone improves over an RGB-only baseline, that D-MAPS and MoDAR are complementary, and that the complete model yields the best MPJPE, MPVPE, and Accel among the ablated variants (Cen et al., 4 Feb 2026).

The stereo-inertial realization evaluates JPE, PVE, SIP, TE, Jerk, and FS. It is reported to achieve lower translation error than monocular fusion baselines such as RobustCap and RobustCap3D, competitive or better JPE and PVE, and better or comparable foot-skating and jerk. Qualitatively, it produces drift-free global translation over long recordings and reduces foot-skating effects, while running at over 200 FPS without optimization-based post-processing (Tang et al., 2 Mar 2026).

Several interpretive points recur across this literature. First, metric consistency is not equivalent to temporal smoothness: the monocular D-MAPS paper explicitly notes that temporal smoothing does not solve wrong depth or wrong scale, and the stereo-inertial work similarly grounds long-term stability in metric stereo anchors rather than in acceleration integration alone. Second, explicit depth losses are not mandatory in every formulation. In the monocular D-MAPS pipeline, “Depth information acts solely as a feature cue without requiring depth-specific loss functions,” whereas D-PoSE trains with direct depth supervision. Third, evaluation metrics can obscure the intended property. The D-MAPS paper notes that rigid bone constraints can slightly increase PA-MPJPE because PA-MPJPE is scale-normalized, even when the reconstruction is metrically more faithful (Cen et al., 4 Feb 2026).

The limitations are equally structured. In the monocular D-MAPS estimator, unreliable depth causes the sequence gate $\mathbf{q}_r,\mathbf{q}_d$ 6 and frame weights $\mathbf{q}_r,\mathbf{q}_d$ 7 to fall back toward template bone priors, so scale estimation degrades toward population averages. The method also depends on reasonably accurate 2D keypoints and human detection. In the stereo-inertial realization, the assumptions include a calibrated stereo camera with known baseline and intrinsics, reliable synchronization between camera and IMUs, a per-subject T-pose for shape estimation, and a single person in view. Reported limitations include finite stereo working volume, sensitivity to poor lighting, under-use of the IMUs’ 400 Hz sampling because inference is performed at the camera frame rate, the restriction of shape modeling to SMPL $\mathbf{q}_r,\mathbf{q}_d$ 8 coefficients, and degraded metric accuracy when occlusion increases or the subject moves far from the stereo camera (Tang et al., 2 Mar 2026).

A final conceptual caution comes from the metric-HMR literature: “metric” does not imply a unique absolute reconstruction; it denotes a reconstruction with reasonable physical scale under the chosen camera model. That observation does not weaken D-MAPS. Rather, it clarifies its aim: to compress the monocular or multimodal ambiguity class into a metrically plausible, temporally stable, and anthropometrically coherent solution that can support downstream geometric reasoning (Zhang et al., 11 Jun 2025).