Dynamic, real-time interactive worlds in unified models

Develop unified multimodal generative models that support continuous, real-time closed-loop interaction to create truly dynamic and interactive worlds, overcoming the current limitation of visual-prior unified models for text-to-image and text-to-video that are restricted to single-shot synthesis or stepwise editing.

Background

Stage II consolidates processing and generation for multiple modalities into a single backbone and paradigm, reducing fragmentation and enabling cross-modal transfer. However, visual-prior unified models remain constrained to one-shot synthesis or incremental editing and lack continuous, real-time interaction capabilities.

The authors explicitly identify achieving truly dynamic and interactive worlds within the unified modeling paradigm as an open challenge, motivating the transition to Stage III focused on interactive generative models.

References

Thus, while Stage II unified architectures, the creation of truly dynamic and interactive worlds remains an open challenge and motivates Stage III.

— From Masks to Worlds: A Hitchhiker's Guide to World Models (2510.20668 - Bai et al., 23 Oct 2025) in Section 4.2 (Benefits and Gaps)

Dynamic, real-time interactive worlds in unified models

Background

References

Related Problems