Dynamic, real-time interactive worlds in unified models
Develop unified multimodal generative models that support continuous, real-time closed-loop interaction to create truly dynamic and interactive worlds, overcoming the current limitation of visual-prior unified models for text-to-image and text-to-video that are restricted to single-shot synthesis or stepwise editing.
References
Thus, while Stage II unified architectures, the creation of truly dynamic and interactive worlds remains an open challenge and motivates Stage III.
— From Masks to Worlds: A Hitchhiker's Guide to World Models
(2510.20668 - Bai et al., 23 Oct 2025) in Section 4.2 (Benefits and Gaps)