Visual Imitation Enables Contextual Humanoid Control (2505.03729v3)

Published 6 May 2025 in cs.RO and cs.CV

Abstract: How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them-casually capture a human motion video and feed it to humanoids. We introduce VIDEOMIMIC, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills-all from a single policy, conditioned on the environment and global root commands. VIDEOMIMIC offers a scalable path towards teaching humanoids to operate in diverse real-world environments.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents VideoMimic, a real-to-sim-to-real pipeline that allows humanoid robots to learn complex skills directly from monocular human videos.
VideoMimic leverages 4D human-scene reconstruction and motion retargeting to generate simulation data, used in a multi-stage RL process to train a generalist control policy.
The learned policy, conditioned on a local heightmap and root direction, enables a real Unitree G1 robot to robustly execute various context-aware tasks like climbing stairs and sitting.

The paper "Visual Imitation Enables Contextual Humanoid Control" (2505.03729) introduces VideoMimic, a real-to-sim-to-real pipeline designed to enable humanoid robots to learn complex, context-aware whole-body skills directly from everyday monocular human videos. The core idea is to capture a human performing a task in a specific environment (like climbing stairs or sitting on a chair), reconstruct the human motion and the surrounding environment in 3D, use this data to train a control policy for a humanoid robot in a physics simulator, and then deploy that policy on a real robot.

The pipeline consists of two main stages: Real-to-Sim Data Acquisition and Policy Learning.

Real-to-Sim Data Acquisition

This stage converts raw video footage into data usable for simulation-based reinforcement learning.

Preprocessing: The pipeline uses off-the-shelf computer vision methods to extract initial data from the video. This includes human pose and shape estimation (using SMPL models via VIMO [wang2024tram]), 2D joint keypoints (ViTPose [xu2022vitpose]), foot contact estimation (BSTRO [huang2022rich]), and scene structure from motion (SfM) (MegaSaM [li2024megasam] or MonST3R [zhang2024monst3r]) to get per-frame camera poses and a raw scene point cloud. An initial coarse alignment of the human in the scene is performed using insights from SLAHMR [ye2023slahmr] based on limb lengths and focal length.
Joint Human-Scene Reconstruction: The system jointly optimizes the human's 4D trajectory (global translation and orientation, local poses) and the scene point cloud scale to achieve a metrically accurate and aligned reconstruction. This is formulated as an optimization problem minimizing 3D and 2D joint reprojection losses and temporal smoothness regularizers. A human height prior is used to resolve the scale ambiguity inherent in monocular SfM. Optionally, a scale-adaptation pass can reshape the human model to match the robot's embodiment, making complex actions more feasible, although this is skipped for real-world deployment to work with metric scale. The optimization is solved using a Levenberg-Marquardt solver implemented in JAX [yi2024egoallo], capable of processing a 300-frame sequence quickly (around 20 ms on an NVIDIA A100 GPU after compilation). The objective function for this optimization is:

$\arg\min_{\alpha,\gamma,\phi,\theta}\; w_{3\text{D}}L_{3\text{D}} + w_{2\text{D}}L_{2\text{D}} + L_{\text{Smooth}}$

where $L_{3\text{D}}$ is the L1 distance between estimated and lifted 3D joints, $L_{2\text{D}}$ is the L1 distance between estimated and detected 2D keypoints projected from 3D, and $L_{\text{Smooth}}$ penalizes frame-to-frame changes in root translation ( $\gamma$ ) and local pose ( $\theta$ ).
Generating Simulation-Ready Data:
- Gravity Alignment: The reconstructed 3D data is aligned with real-world gravity using GeoCalib [veicht2024geocalib], rotating the scene and human trajectories so the +z axis points upwards, as expected by physics simulators.
- Pointcloud Filtering and Meshification: The dense, noisy point cloud is filtered (removing background/dynamic points, cropping, voxel downsampling) and converted into a lightweight mesh using NKSR [huang2023nksr]. Top-down ray casting is used to fill holes in the mesh. This mesh represents the static environment in the simulator.
- Humanoid Motion Retargeting: The refined human motion trajectories are adapted to the kinematics and physical constraints of the target humanoid robot (Unitree G1). This is treated as an optimization problem (solved with Levenberg-Marquardt using PyRoki [kim2025pyroki]) to find robot joint angles and root poses that track the human motion while respecting joint limits, avoiding self-collision, ensuring contact with the environment mesh (derived from estimated foot contacts [huang2022rich]), and minimizing skating. The cost function includes terms for tracking kinematic tree correspondences, matching foot contact points, penalizing skating, avoiding collisions, respecting joint limits, and temporal smoothness.

This process yields pairs of retargeted humanoid motion trajectories and environment meshes, ready for physics simulation and policy training.

Policy Learning

This stage trains a reinforcement learning policy in the IsaacGym [isaacgym] simulator using the generated motion-mesh data.

RL Setup: The policy is trained using Proximal Policy Optimization (PPO) [ppo]. The training is highly parallelized across thousands of simulated environments on multiple GPUs (e.g., 8192 environments across 2 NVIDIA 4090 GPUs).
Observations: The robot policy receives proprioceptive inputs (history of joint positions/velocities, angular velocity, projected gravity, previous actions) and local target observations. Initially, in early training stages, this includes target joint angles, root roll/pitch, and desired root direction relative to the robot. Crucially, for the final deployable policy, the observation set is reduced to include only proprioception, a local $11 \times 11$ heightmap patch centered on the torso (sampled at 0.1m intervals), and the desired root direction (as x-y offset and yaw in the robot's local frame). The critic receives additional privileged observations (Table 2 in the paper details the observations).
Actions: The policy outputs target joint positions, which are then passed through a PD controller. Actions are clipped during training, and a bounds loss is used to encourage outputs within the desired range.
Rewards: The reward function is primarily composed of data-driven tracking terms: rewarding closeness to reference link/joint positions and velocities, and matching foot contact states. Penalties are included for high action rates (especially ankle actions), exceeding DOF limits, collisions, and skating (velocity during contact). The reward aims to balance tracking the human kinematic data with physical feasibility.
Training Stages: A four-stage training process is used:
- Stage 1: MoCap Pre-Training (MPT): The policy is initially trained on retargeted professional motion capture data (LAFAN [harvey2020robust]) on flat ground. This helps the policy learn basic motor skills and bridge the human-to-robot embodiment gap with relatively clean data. The policy is conditioned on target joint angles and root information.
- Stage 2: Scene-Conditioned Tracking: The policy is fine-tuned using the video-reconstructed motion-mesh pairs. The local heightmap observation is introduced. The policy still receives target joint angles and root information from the reference motion. This stage uses a batched version of DeepMimic [DeepMimic], sampling motions and tracking them in their respective reconstructed environments.
- Stage 3: Distillation: The policy is distilled using DAgger [dagger] to operate without observing target joint angles or target root roll/pitch. The only target information is the desired root direction, which can come from an external source like a joystick or high-level planner. This creates a generalist policy no longer tied to specific motion clips via full kinematic targets.
- Stage 4: Under-conditioned RL Finetuning: The distilled policy is further trained using RL with the reduced observation set (proprioception, heightmap, desired root direction). This fine-tuning step significantly boosts performance and robustness, allowing the policy to learn recovery behaviors and handle noisier references by not being strictly constrained by full kinematic targets.
Domain Randomization: Extensive domain randomization is applied during training to improve sim-to-real transfer. This includes randomization of DOF friction, random pushes, observation noise (additive bias and white noise for gravity, joint positions/velocities, velocities, etc.), odometry update rate, and heightmap sensing noise (white noise, offset noise, roll/pitch/yaw noise, sensor delay, update frequency, bad-distance probability).

Real-world Deployment

The final distilled and finetuned policy is deployed on a Unitree G1 humanoid robot.

Hardware and Software: The policy runs onboard the robot's Jetson Orin NX at 50Hz using C++, ROS, and the Unitree SDK.
Sensing: Proprioceptive data is read directly. The heightmap is generated in real-time from the robot's LiDAR using Fast-lio2 [xu2021fastlio2fastdirectlidarinertial] and probabilistic terrain mapping [Fankhauser2018ProbabilisticTerrainMapping]. Joystick inputs provide the desired root direction.
Deployment Strategy: A progressive deployment strategy was used, starting with testing MoCap policies and then increasingly generalist policies. Key findings for successful real-world transfer include relaxing episode termination tolerances in simulation training and extensive domain randomization.
Demonstrated Skills: The robot successfully executes diverse, context-aware skills learned from the 123 curated video clips, including climbing and descending various stairs, sitting on and standing up from chairs/benches, and traversing rough terrain. These skills are handled by a single policy conditioned on the local heightmap and desired root direction, without explicit task labels or skill selection logic. The policy shows resilience, capable of recovering from unexpected disturbances like foot slips.

Results and Evaluation

Reconstruction: Quantitative evaluation on SLOPER4D [dai2023sloper4d] shows that VideoMimic's reconstruction pipeline outperforms previous methods (WHAM [shin2024wham], TRAM [wang2024tram]) in both human trajectory accuracy (lower WA/W-MPJPE) and scene geometry accuracy (lower Chamfer Distance) (Table 1). The pipeline also demonstrates versatility in handling dynamic scenes, multiple humans, and enables ego-view rendering.
Policy Learning: Ablation studies show that MoCap pre-training is crucial for successfully training on the noisier video-derived data (Figure 1). The multi-stage RL pipeline effectively distills complex tracking behaviors into a generalist policy conditioned only on environment and root commands.
Real-world: Qualitative results (Figure 2 and project video) demonstrate the learned policy's ability to perform complex, context-dependent tasks robustly on a real robot in diverse environments, including previously unseen stairs and furniture.

Limitations

The paper acknowledges several practical limitations: brittleness and artifacts in monocular 4D reconstruction, challenges in retargeting in cluttered scenes, the limited resolution of the 11x11 heightmap for fine manipulation, the assumption of rigid scenes in simulation, and the potential for jerky motions due to the limited scale and quality of the current video dataset (123 clips).

In summary, VideoMimic presents a promising data-driven approach for endowing humanoids with versatile, context-aware skills by leveraging readily available human video data, bridging the gap between visual observation, simulation learning, and real-world robot control.

PDF Markdown

Follow-up Questions

Related Papers

Authors (10)

Tweets

https://twitter.com/arankomatsuzaki/status/1919943279266955458

https://twitter.com/RoboReading/status/1920122691551445108

https://twitter.com/arxivsanitybot/status/1920836457511428162

YouTube

Show All Videos