Capturing action-conditioned physical dynamics for robotic manipulation

Develop action-conditioned video-based world models that accurately capture contact-rich interactions, robot kinematics, and fine-grained physical dynamics necessary to predict object motion under specified 7-DoF end-effector actions and to support closed-loop planning and control for robotic manipulation tasks in RLBench.

Background

The paper evaluates visual world models across four embodied tasks and finds that while these models often aid perception and navigation, their benefits for robotic manipulation are modest. The authors attribute this to the difficulty of accurately modeling contact-rich interactions, robot kinematics, and fine-grained dynamics that are essential for manipulation.

In RLBench-based manipulation experiments, even post-trained video generators show only small improvements over strong baselines, indicating that current models struggle with precise action-conditioned predictions in physically complex settings. This motivates the explicit identification of robust physical dynamics modeling as an unresolved challenge for embodied world models.

References

This gap suggests that while current visual world models can effectively guide perception and navigation, capturing fine-grained physical dynamics and action-conditioned object motion remains an open challenge.

— World-in-World: World Models in a Closed-Loop World (2510.18135 - Zhang et al., 20 Oct 2025) in Section 4.1 (Benchmark Results), Robotic Manipulations

Capturing action-conditioned physical dynamics for robotic manipulation

Background

References

Related Problems