X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real (2505.07096v3)

Published 11 May 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.

Summary

The paper proposes X-Sim, a real-to-sim-to-real framework that learns robot manipulation policies from human videos by using object motion in simulation for rewards, eliminating the need for action labels.
A key contribution is a novel online domain adaptation method that aligns real and simulated observations to distill robust, generalizable image-conditioned diffusion policies.
Empirical evaluation shows X-Sim achieves an average 30% improvement in task progress over hand-tracking baselines across five manipulation tasks, demonstrating enhanced performance and data efficiency.

Cross-Embodiment Learning via Real-to-Sim-to-Real

The paper addresses a significant challenge in robot manipulation learning: developing effective polices from action-less human demonstration videos, thereby circumventing the need for labor-intensive teleoperation data. The proposed framework, referred to as a real-to-sim-to-real methodology, utilizes object motion rather than human actions to train reinforcement learning (RL) policies, facilitating cross-embodiment learning. The key advancements presented in this research lie in reconstructing real-time environments in simulation to use object-centric rewards for training policies and directly transferring them to real-world applications with image-based observations.

The real-to-sim-to-real pipeline initiates by translating RGBD human video inputs into photo-realistic simulations that meticulously track object trajectories to define reward functions. These object-centric rewards enable the training of RL policies that instruct robotic actions needed to achieve similar object end-positions as depicted in human demonstrations. The capability to train without direct human action-labeling emerges as a substantial advantage, removing the constraints posed by existing hand-tracking and kinesthetic teaching methodologies, which often falter in scenarios with significant human-robot embodiment differences.

A salient contribution is the novel domain adaptation method, executed online during deployment, which aligns real and simulated observations via continuous calibration. This adaptation mechanism leverages synthetic rollouts to distill learned policies into image-conditioned diffusion policies. These image-based policies demonstrated marked improvement in realism and generalization across variable lighting conditions and camera viewpoints without the need for teleoperation, setting a new precedent in data efficiency—achieving similar proficiency with a fraction of the data a traditional behavior cloning method requires.

The empirical assessment spans five distinct manipulation tasks across two environments, demonstrating that the framework outperforms standard hand-tracking baselines by an average of 30% in task progress. This difference is particularly evident in complex tasks with intricate transformations, where the baseline methods struggle due to visual or kinematic mismatches inherent in the human-robot translation process.

This research positions itself well with respect to future advancements in AI, particularly in how robots can be made to learn from human demonstrations available in diverse in-the-wild video datasets. The implications of this paper extend beyond immediate benefits in enhanced manipulation policies; it sets the groundwork for reducing the sim-to-real gap through robust, scalable learning methodologies. This work also foreshadows potential integrations with pre-trained robot foundation models, capitalizing on large corpora for fine-tuning and application in novel environments without additional teleoperation datasets.

In summary, the paper presents a compelling solution to a critical bottleneck in robot learning from human videos, demonstrating that object-centric designs and real-to-sim-to-real strategies can effectively bridge the human-robot embodiment gap. This line of inquiry beckons further examination into scalable data generation techniques and cross-domain learning frameworks, potentially shaping the trajectory of future developments within robotic manipulation and AI research domains.

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/kushalk_/status/1938293819960877284