Appropriate Role of Generative Video Models in Robotic Manipulation

Determine the appropriate functional role that large generative video models should serve within robotic manipulation systems to effectively leverage their visual predictions given the embodiment gap between humans and robots. The goal is to ascertain how such models should be integrated into manipulation pipelines so their imagined object motions can be translated into executable robot actions.

Background

Generative video models can synthesize plausible interactions conditioned on an initial image and language instruction, capturing intuitive physics and object priors that are valuable for open-world robot manipulation. However, these models predominantly depict human embodiments, creating an embodiment gap between the predicted motions and robot action spaces.

The paper proposes Dream2Flow to bridge video generation and robot control via 3D object flow, arguing that reconstructing and tracking object motion provides a general interface. Nonetheless, the broader question of how video models should be positioned within manipulation systems remains explicitly posed by the authors as unclear.

References

Despite their promise, it remains unclear what role such models should serve in a robot manipulation system.

Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow (2512.24766 - Dharmarajan et al., 31 Dec 2025) in Section 1 (Introduction)