Learning and reasoning over long-horizon vision–language sequences
Establish effective methods to learn from and reason over long-horizon interleaved vision–language sequences that capture extended temporal dynamics and semantic coherence.
References
Recent advances in short-clip video generation have demonstrated the ability to capture short-term dynamics, but learning from and reasoning over long-horizon vision-language sequences remains a central open challenge.
— Emu3.5: Native Multimodal Models are World Learners
(2510.26583 - Cui et al., 30 Oct 2025) in Section 1 (Introduction)