MeshMimic: Geometry-Aware Humanoid Imitation

Updated 4 July 2026

MeshMimic is a geometry-aware framework that reconstructs both human motion and 3D scene structure to enable metric-aligned, contact-rich humanoid imitation.
It integrates monocular video input, kinematic consistency optimization, and contact-invariant retargeting in a real-to-sim-to-real pipeline for robust terrain interactions.
The method outperforms prior approaches with improved MPJPE and Chamfer metrics, demonstrating enhanced stability in tasks like stepping, climbing, and vaulting.

MeshMimic is a geometry-aware humanoid motion learning framework that couples 3D scene reconstruction, human motion recovery, contact-aware retargeting, and reinforcement learning to train humanoid policies directly from ordinary monocular RGB video. Its defining premise is that terrain is part of the motion: instead of imitating a reference trajectory on a flat or simplified ground model, MeshMimic reconstructs both the human and the surrounding environment, optimizes them into a metric and world-aligned interaction sequence, transfers the resulting interaction pattern to a humanoid robot, and then trains a policy in simulation before deployment on hardware. The framework is explicitly presented as a real-to-sim-to-real pipeline intended to avoid the cost and scene incompleteness of motion capture data while preserving coupled “motion-terrain” interactions such as stepping, climbing, vaulting, and box contact (Zhang et al., 17 Feb 2026).

1. Problem setting and conceptual scope

MeshMimic addresses a limitation that the paper describes as a decoupling of motion from scene geometry. In that formulation, motion synthesis frameworks may recover or imitate human movement, but they usually do not retain the geometric context of the surrounding terrain and objects. The stated consequence is physical inconsistency in terrain-aware tasks, including contact slippage, mesh penetration, foot skating, contact misalignment, hovering above the surface, and unstable terrain interactions.

The framework is positioned against several prior categories. MoCap-based methods provide accurate human motion but usually lack the surrounding environment geometry. WHAM and TRAM recover human motion but not the environment. VideoMimic uses video but is described as suffering from coarse scene modeling and weaker contact handling. OmniRetarget is described as improving object interaction, but as being limited to simpler geometric settings and as not generalizing well to irregular large-scale terrain. Within that comparison, MeshMimic’s central claim is that scene geometry must be reconstructed and explicitly incorporated into the imitation loop if a humanoid agent is to reproduce contact-rich human behaviors robustly on irregular terrain (Zhang et al., 17 Feb 2026).

The resulting objective is not pose imitation in isolation. The paper instead defines the target as coupled human-environment interaction recovered from video and then transferred to a humanoid robot. This makes the framework terrain-aware by construction rather than through a post hoc contact correction.

2. Reconstruction pipeline and metric world alignment

The pipeline is organized as seven stages: monocular video input, scene reconstruction, human detection and tracking, monocular body reconstruction, joint human-scene alignment and kinematic consistency optimization, contact-invariant retargeting, RL policy training in simulation, and deployment on the real robot.

For environment reconstruction, MeshMimic uses $\pi^3$ to estimate per-frame depth maps $D^t$ , camera poses $[R^t \mid \mathbf{t}^t]$ , and shared intrinsics $K$ . The environment is represented as planar polygonal primitives rather than a flat plane or a primitive-only proxy, and the reconstructed scene is converted into high-resolution collision geometry and TSDF-based surfaces for downstream optimization and retargeting. For the human actor, the pipeline uses ViTDet for person detection, SAM2 for cross-frame identity association and tracking, and SAM3D for monocular 3D human reconstruction. Using the official SAM3D-Body pipeline, the method converts the intermediate human mesh representation into SMPL-X parameters, yielding pose parameters $\boldsymbol{\theta}^t$ , shape $\boldsymbol{\beta}$ , 3D joints $\mathbf{J}^t$ , orientation $\boldsymbol{\phi}^t$ , and translation $\mathbf{t}^t$ (Zhang et al., 17 Feb 2026).

A key issue at this stage is that both the human and scene reconstructions are not metrically scaled and are initially expressed in camera coordinates. MeshMimic therefore introduces a joint human-scene alignment step. Because SAM3D provides a strong initialization, the paper states that pose and shape are kept fixed and only per-frame translation $\mathbf{t}^{0:T}$ and a single global scene scale $D^t$ 0 are optimized. The alignment objective is

$D^t$ 1

where $D^t$ 2 is the 2D joint reprojection error between projected SMPL-X joints and SAM3D-body 2D keypoints, and $D^t$ 3 is a symmetric Chamfer distance between camera-facing SMPL-X vertices and the metric-scale human point set. To avoid erroneous backside matches, only camera-facing vertices are used, selected by a vertex-normal/view-direction angle threshold of $D^t$ 4.

This stage converts the raw monocular reconstruction into a metric, world-aligned reference suitable for contact reasoning and robot retargeting. The paper treats that conversion as structurally necessary, not as a refinement of convenience.

3. Kinematic consistency optimization and contact reasoning

The paper emphasizes that raw monocular reconstruction remains noisy under fast camera motion, blur, occlusion, and depth errors. The reported failure modes include interpenetration, hovering, and trajectory drift. MeshMimic addresses these issues with a kinematic consistency optimization whose intended total loss is

$D^t$ 5

The components are alignment loss, contact loss, penetration loss, trajectory smoothness loss, and foot-snapping loss (Zhang et al., 17 Feb 2026).

Contact supervision is derived from depth-edge-guided contact prediction. The method computes a human silhouette boundary $D^t$ 6 from human segmentation using a morphological gradient, depth edges $D^t$ 7 from monocular depth, and a dilated exclusion region $D^t$ 8. Contact pixels are then defined as

$D^t$ 9

with $[R^t \mid \mathbf{t}^t]$ 0. From these pixels, the method selects background points whose projections fall into the band as candidate scene contacts. Given corresponding scene contact points $[R^t \mid \mathbf{t}^t]$ 1 and matched human vertex indices $[R^t \mid \mathbf{t}^t]$ 2, the contact loss is

$[R^t \mid \mathbf{t}^t]$ 3

Its role is to anchor predicted contacting human vertices to the reconstructed scene surface and reduce hovering above terrain.

Penetration is handled with a TSDF volume constructed from the background point cloud and normals. For a vertex $[R^t \mid \mathbf{t}^t]$ 4, the TSDF value $[R^t \mid \mathbf{t}^t]$ 5 is positive outside and negative inside the surface. MeshMimic uses a slackened penetration penalty

$[R^t \mid \mathbf{t}^t]$ 6

together with a Huber-style robust loss

$[R^t \mid \mathbf{t}^t]$ 7

Trajectory smoothness is enforced by

$[R^t \mid \mathbf{t}^t]$ 8

where $[R^t \mid \mathbf{t}^t]$ 9 is the number of frames, $K$ 0 is the frame rate, and $K$ 1 is the global translation at frame $K$ 2. Foot-ground hovering is further reduced by a foot-snapping term that activates only when a foot joint is already near the terrain:

$K$ 3

The optimization variables remain restricted to global translation per frame and one scene scale factor, while pose and shape remain fixed after initialization. According to the paper, the effect is improved contact placement, improved surface adherence, improved trajectory smoothness, and reduced penetration and hovering.

4. Contact-invariant retargeting and policy learning

After human-scene reconstruction and optimization, MeshMimic retargets the motion to a humanoid robot using MeshRetarget, described as a contact-aware extension of interaction-mesh retargeting. Following OmniRetarget, the method constructs an interaction mesh from human and robot anatomical keypoints, sampled object points, and sampled terrain points. The robot configuration $K$ 4 is then optimized per frame by minimizing the Laplacian deformation energy of this interaction mesh, thereby preserving relative spatial structure among body parts, objects, and terrain (Zhang et al., 17 Feb 2026).

The paper identifies point sampling as critical in large-scale scenes. If terrain points are sampled too far from the human, the deformation energy may not reflect local interaction quality. MeshMimic therefore samples not only global terrain points but also additional terrain points near the human. The optimizer is SQP-style and enforces hard constraints for collision avoidance, joint limits, velocity limits, and stance-foot anchoring to prevent foot skating.

A further terrain penetration correction is applied because collision-free human motion does not guarantee collision-free robot motion under morphology mismatch. The method builds a terrain TSDF, evaluates robot vertices $K$ 5, computes a correction direction from the average SDF gradient over penetrated or near-surface vertices,

$K$ 6

sets $K$ 7, and chooses the smallest $K$ 8 by line search such that

$K$ 9

with $\boldsymbol{\theta}^t$ 0 allowing slight tolerance. The stated purpose is to preserve contact-invariant interaction geometry while ensuring collision-free robot placement.

Policy learning is performed in IsaacLab using asymmetric PPO. The paper describes a relatively minimal BeyondMimic-style formulation in which actor and critic are 4-layer MLPs with hidden sizes $\boldsymbol{\theta}^t$ 1, with a 5-step observation history and a 5-step future motion horizon. The actor receives reference motion features, proprioception, and previous actions. The critic additionally receives privileged scene information. Global torso position is included as an additional observation during training and is obtained from an optical motion-capture system during deployment. The reward is lightweight and mostly tracking-based, consisting of anchor tracking, body tracking, an action rate penalty, and a soft joint limit penalty. To improve efficiency, the authors first pretrain a generic whole-body tracker on about 50 hours of non-interactive human motion data and then fine-tune on the scene-interactive references. The resulting policy runs on a Unitree G1 at 50 Hz on an NVIDIA Jetson Orin.

5. Benchmarks, tasks, and reported results

MeshMimic is evaluated on both reconstruction quality and downstream real2sim2real behavior. The task set includes flat walking, jumping onto a box, running single-leg jump onto a box, climbing boxes, side climbing, safety vaulting, and a multi-stage jump-climb-descend sequence. The named evaluation scenes are Walk1, JB1, JB2, CB1, CB2, SV1, SV2, and JCD1. The corresponding descriptions include, for example, JB1 as a jump onto a 40 cm box, CB2 as a side climb onto a 60 cm box, and JCD1 as jumping onto a 20 cm box, climbing onto a 60 cm box, and descending with single-hand support (Zhang et al., 17 Feb 2026).

For reconstruction, the paper evaluates on a subset of SLOPER4D using sequences where SAM2 tracking succeeds, with two sequences each for running, walking, and stair ascent/descent. The metrics are W-MPJPE, defined as world-frame MPJPE after aligning only the first two frames of each 100-frame segment; WA-MPJPE, defined as world-aligned MPJPE over the segment; and Chamfer distance between the predicted scene point cloud and a LiDAR point cloud. The reported values are:

WHAM: WA-MPJPE $\boldsymbol{\theta}^t$ 2, W-MPJPE $\boldsymbol{\theta}^t$ 3, Chamfer unavailable.
TRAM: WA-MPJPE $\boldsymbol{\theta}^t$ 4, W-MPJPE $\boldsymbol{\theta}^t$ 5, Chamfer $\boldsymbol{\theta}^t$ 6.
VideoMimic: WA-MPJPE $\boldsymbol{\theta}^t$ 7, W-MPJPE $\boldsymbol{\theta}^t$ 8, Chamfer $\boldsymbol{\theta}^t$ 9.
MeshMimic: WA-MPJPE $\boldsymbol{\beta}$ 0, W-MPJPE $\boldsymbol{\beta}$ 1, Chamfer $\boldsymbol{\beta}$ 2.

The paper further reports improvements over VideoMimic of $\boldsymbol{\beta}$ 3 in WA-MPJPE, $\boldsymbol{\beta}$ 4 in W-MPJPE, and $\boldsymbol{\beta}$ 5 in Chamfer, while scene Chamfer improves from $\boldsymbol{\beta}$ 6 for TRAM to $\boldsymbol{\beta}$ 7 for MeshMimic.

For real2sim2real evaluation, the paper compares MMM+MMT, VMM+MMT, and VMM+VMT, where MMM+MMT denotes MeshMimic motion plus MeshMimic terrain, VMM+MMT denotes VideoMimic motion plus MeshMimic terrain, and VMM+VMT denotes VideoMimic motion plus VideoMimic terrain. The reported qualitative finding is that MMM+MMT yields the highest training reward and the best real-world success rate. VideoMimic motion is described as exhibiting foot-in-air artifacts, interpenetration, and trajectory drift, while VideoMimic terrain reconstruction introduces floating obstacles, hollow surfaces, and uneven or corrupted geometry.

An ablation studies the addition of global torso position as an observation cue. The reported success rates without and with torso position are: Walk1 $\boldsymbol{\beta}$ 8 and $\boldsymbol{\beta}$ 9; JB1 $\mathbf{J}^t$ 0 and $\mathbf{J}^t$ 1; JB2 $\mathbf{J}^t$ 2 and $\mathbf{J}^t$ 3; SV1 $\mathbf{J}^t$ 4 and $\mathbf{J}^t$ 5; SV2 $\mathbf{J}^t$ 6 and $\mathbf{J}^t$ 7; CB1 $\mathbf{J}^t$ 8 and $\mathbf{J}^t$ 9; CB2 $\boldsymbol{\phi}^t$ 0 and $\boldsymbol{\phi}^t$ 1; JCD1 $\boldsymbol{\phi}^t$ 2 and $\boldsymbol{\phi}^t$ 3. The paper interprets this as evidence that global torso position helps most on long-horizon, path-dependent tasks such as JB2, CB1, and JCD1, while it can hurt on short, highly dynamic tasks such as SV1, SV2, and CB2.

6. Relation to adjacent mesh-based research and stated limitations

MeshMimic belongs to a broader family of mesh- or geometry-aware learning systems, but it occupies a distinct problem class. It is not a learned mesh-based physical simulator of the kind represented by “MeshGraphNet-Transformer: Scalable Mesh-based Learned Simulation for Solid Mechanics,” which addresses industrial-scale solid-mechanics simulation and message-passing under-reaching on Lagrangian meshes (Iparraguirre et al., 30 Jan 2026). It is also not a mesh movement network such as “Towards Universal Mesh Movement Networks,” which treats r-adaptive mesh movement as a PDE-solver component and uses a Graph Transformer encoder and GAT-based decoder for zero-shot mesh relocation across PDEs and geometries (Zhang et al., 2024). Nor is it identical to “mesh-based video action imitation,” where M-VAI reconstructs human meshes from source video, smooths them with mesh2mesh, and transfers pose to an arbitrary target identity mesh; that formulation targets mesh-based action imitation but does not center the coupled reconstruction of terrain geometry and humanoid control policy learning (Fu et al., 2021).

This distinction matters because MeshMimic’s mesh-related element is not mesh editing or mesh-native simulation per se. Its core object is a geometry-aware humanoid imitation loop in which reconstructed scene structure is a prerequisite for contact-rich behavior learning. A plausible implication is that the term “mesh” in MeshMimic refers less to a standalone mesh processing task than to the role of reconstructed 3D geometry in embodied control.

The paper notes or implies several limitations. Performance can degrade in highly dynamic scenes with rapid motion, blur, and occlusions. Global torso position can be noisy when externally estimated. Failures still occur in some long-horizon or complex multi-contact tasks. Reconstruction quality depends on the quality of the monocular vision stack and on successful tracking by SAM2. The approach also assumes that the scene can be reasonably reconstructed from the input video and represented in a form suitable for TSDF construction and contact reasoning (Zhang et al., 17 Feb 2026).

In that sense, MeshMimic’s contribution is not only a lower-cost alternative to MoCap, but a reformulation of humanoid imitation around explicit motion-terrain coupling. Its reported results support the paper’s claim that monocular video, 3D scene reconstruction, kinematic consistency optimization, and contact-invariant retargeting can be combined into a single pipeline for learning terrain-aware humanoid behavior.