Simultaneously achieving real-time latency and long-term geometric consistency in interactive world models

Determine whether and how to construct an interactive world modeling system for autoregressive streaming video generation that simultaneously achieves low latency sufficient for real-time user interaction and high long-term geometric consistency such that scenes remain coherent upon revisiting previously observed locations.

Background

The paper identifies a core challenge in interactive world modeling: current approaches tend to prioritize either speed (real-time generation via distillation) or memory (long-term geometric consistency via explicit or implicit memory), but not both. Distillation-based methods often sacrifice consistency, while memory-centric methods introduce complexity that hampers distillation, leading to a persistent trade-off between latency and consistency.

WorldPlay is proposed as an approach to address this gap through dual action representation, reconstituted context memory, and context forcing for distillation. The authors frame the simultaneous attainment of both properties as an open problem in the field, motivating their contributions.

References

As summarized in Table~\ref{tab:compare_related_works}, the simultaneous achievement of both low latency and high consistency remains an open problem.

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling (2512.14614 - Sun et al., 16 Dec 2025) in Section 1 (Introduction), paragraph discussing Table 1