RoboScape: Unified Physics-Driven World Model

Updated 3 July 2025

RoboScape is a unified, physics-informed world model that integrates RGB video and depth prediction to achieve realistic 3D geometry and motion.
It employs dual-branch Spatial-Temporal Transformers with temporal depth and keypoint dynamics supervision to ensure geometric consistency and physical plausibility.
RoboScape enhances simulation-based evaluation and robotic policy training by producing high-fidelity synthetic data, as validated by improved LPIPS, PSNR, and depth metrics.

RoboScape refers to a class of unified, physics-informed embodied world models developed to generate physically plausible and visually realistic robotic videos while serving as effective simulators for data-driven robotics applications. The RoboScape framework addresses the limitations of previous world models in capturing 3D geometry and motion dynamics, thereby advancing embodied intelligence research and enabling new forms of policy training, simulation-based evaluation, and scalable dataset synthesis.

1. Definition and Core Objectives

RoboScape is a unified physics-informed world model that jointly learns to generate RGB video sequences and incorporate physics knowledge, such as temporal depth and keypoint dynamics, within an integrated architecture. Its primary objective is to overcome the physical implausibility and low geometric consistency observed in prior embodied world models, particularly in contact-rich robotic manipulation scenarios. This is achieved by infusing explicit physics supervision into the generative process, resulting in enhanced 3D structure and physically realistic motion in synthesized videos. The codebase and model artifacts are available at https://github.com/tsinghua-fib-lab/RoboScape.

2. Architecture and Physics-Aware Joint Training

RoboScape employs a dual-branch architecture leveraging parallel Spatial-Temporal Transformers for simultaneous RGB and depth video token modeling. The framework institutes two key physics-informed supervision strategies:

Temporal Depth Prediction

A temporal depth prediction branch runs in parallel with the RGB prediction pathway, enforcing 3D geometric consistency over time. Depth branch features are projected and additively fused into each block of the RGB branch:

$\mathbf{h}_{\text{RGB}}^l = \mathbf{h}_{\text{RGB}}^l + \mathcal{W}^l(\mathbf{h}_{\text{depth}}^l)$

where $\mathcal{W}^l$ is a learnable projection at layer $l$ . Cross-entropy losses for RGB and depth token predictions are computed jointly:

$\mathcal{L}_{\text{RGB}} = -\sum_{t=1}^T \mathbf{s}_t \log p(\hat{\mathbf{s}}_t), \quad \mathcal{L}_{\text{Depth}} = -\sum_{t=1}^T \mathbf{z}_t \log p(\hat{\mathbf{z}}_t)$

Keypoint Dynamics Learning

Keypoints are densely sampled using pretrained trackers, selecting $K$ with maximum motion amplitude:

$\mathcal{M}_i = \sum_{t=1}^{T-1} \|\mathbf{p}_i^{t+1} - \mathbf{p}_i^t\|_2$

A temporal consistency loss penalizes deviations in visual token features at these locations:

$\mathcal{L}_{\text{Keypoint}} = \frac{1}{(T-1)K}\sum_{i=1}^K\sum_{t=2}^T \|\hat{\mathbf{s}}_t(\mathbf{p}_i^t) - \hat{\mathbf{s}}_1(\mathbf{p}_i^1)\|_2^2$

Keypoint-guided attention modifies the cross-entropy loss to focus on high-motion regions:

$\mathbf{A}_{t,x,y} = \begin{cases} \gamma, & \text{if } (t,x,y) \in \mathcal{T}_{\text{sample}} \ 1, & \text{otherwise} \end{cases}$

$\mathcal{L}_{\text{Attention}} = -\sum_{t=1}^T \mathbf{A}_t \odot \mathbf{s}_t \log p(\hat{\mathbf{s}}_t)$

Final training objective is: $\mathcal{L} = \mathcal{L}_{\text{RGB}} + \lambda_1 \mathcal{L}_{\text{Depth}} + \lambda_2 \mathcal{L}_{\text{Keypoint}} + \lambda_3 \mathcal{L}_{\text{Attention}}$ with hyperparameters $\lambda_1, \lambda_2, \lambda_3$ .

3. Video Generation, Physical Plausibility, and Benchmarking

RoboScape demonstrates substantial improvements in both visual fidelity and physical realism compared to state-of-the-art world models, including IRASim, iVideoGPT, Genie, and CogVideoX. In quantitative benchmarks:

Model	LPIPS↓	PSNR↑	AbsRel↓	δ₁↑	δ₂↑	ΔPSNR↑
IRASim	0.6674	11.57	0.6252	0.501	0.702	0.027
iVideoGPT	0.4963	16.12	0.7586	0.348	0.579	0.114
Genie	0.1683	19.76	0.4425	0.544	0.774	1.987
CogVideoX	0.2180	17.52	0.5243	0.605	0.760	—
RoboScape	0.1259	21.85	0.36	0.621	0.831	3.344

Metrics:

LPIPS and PSNR: Visual similarity
AbsRel, δ₁, δ₂: Depth/geometry consistency
ΔPSNR: Policy controllability

Ablation studies confirm that both temporal depth and keypoint dynamics training are essential for simultaneously achieving geometric integrity and realistic motion. Omission of either term results in geometric distortion or implausible physical behavior in video prediction outputs.

4. Applications in Policy Training and Evaluation

RoboScape serves as a data generator for robotic policy learning and as a simulator for offline policy evaluation:

Policy Training: Incorporating up to 200 RoboScape-generated demonstration trajectories was shown to boost policy success rates in downstream tasks (e.g., Robomimic-Lift), closely matching or exceeding those of policies trained solely on real-world data.
Policy Evaluation: RoboScape provides differentiable, video-based simulation rollouts conditioned on sequences of robotic actions. Simulated success rates estimated within RoboScape correlate strongly ( $r = 0.953$ ) with results from ground-truth physics simulators, outperforming video models lacking physics supervision.

This supports the framework's practical value for pre-screening policy candidates, reducing physical robot trial costs, and accelerating iterative development.

5. Experimental Details and Dataset

RoboScape is trained on a large-scale, multi-modal video suite:

Source: 50,000 clips from AgiBotWorld-Beta across 147 tasks and 72 skills
Scale: 6.5 million training clips, processed over 5 epochs using 32 A800 GPUs
Annotations: Depth maps and keypoints generated via automated pipelines

Robust comparisons with leading baselines in both appearance (LPIPS/PSNR) and physics-aware (AbsRel, δ₁/δ₂, ΔPSNR) evaluations are reported across diverse robotic scenarios.

6. Broader Implications and Future Trajectory

The integrated design of RoboScape—embedding physics knowledge (depth, keypoints) into joint RGB video learning—enables efficient and robust world model training without reliance on heavy, cascaded physics simulations. This approach suggests a path toward scalable robotics research by:

Lowering hardware/data collection requirements
Facilitating safe, efficient learning in simulation before real-world deployment
Enabling new research directions in sim-to-real transfer and offline reinforcement learning where both visual and physical realism are essential

The framework's extensibility and emphasis on physical plausibility position it as a foundation for advanced embodied AI, with implications for future integrations with real-robot platforms and broader use in safety-critical fields such as healthcare and disaster response. An explicit next step, as mentioned in the source, is the unification of RoboScape simulation and real robot evaluation to further validate sim-to-real generalization.

RoboScape connects directly to broader trends in robotics research, including those identified by earlier works on web-scale evolutionary robotics (1406.3337), photorealistic simulation environments (1810.06936), educational visual programming tools for ROS-based robots (2011.13706), and reinforcement-learning-based service robot planners (2103.05225). A plausible implication is that the efficient, physics-informed approach of RoboScape lowers entry barriers for both research and education by supplying high-fidelity synthetic data and robust, controllable environments for simulation-driven innovation.

RoboScape thus represents an architectural and methodological advance in the development of embodied AI, demonstrating that integrated physics supervision within world models contributes directly to higher-quality video prediction, more reliable policy training, and robust offline evaluation for modern robotics research.