Learning and reasoning over long-horizon vision–language sequences

Establish effective methods to learn from and reason over long-horizon interleaved vision–language sequences that capture extended temporal dynamics and semantic coherence.

Background

The paper positions long-horizon multimodal learning—particularly from interleaved vision–language sequences such as long videos with aligned transcripts—as a key capability for world models. While short-clip generation has advanced, sustaining temporal consistency and coherent reasoning over extended sequences remains challenging. Emu3.5 is proposed as a native multimodal next-token predictor trained on large-scale interleaved data to move toward this goal, but the authors explicitly acknowledge this remains open.

Addressing this problem is central to building world models that can generalize across time, maintain consistency of entities and narratives, and integrate visual and linguistic context over many steps. It underpins downstream capabilities such as visual narrative, visual guidance, world exploration, and embodied manipulation.

References

Recent advances in short-clip video generation have demonstrated the ability to capture short-term dynamics, but learning from and reasoning over long-horizon vision-language sequences remains a central open challenge.

— Emu3.5: Native Multimodal Models are World Learners (2510.26583 - Cui et al., 30 Oct 2025) in Section 1 (Introduction)

Learning and reasoning over long-horizon vision–language sequences

Sponsor

Background

References

Related Problems