Enabling general-purpose multimodal interaction at scale
Determine effective approaches to enable general-purpose multimodal interaction within large-scale models trained on interleaved vision–language data, so that a single system can flexibly process and generate interleaved inputs and outputs across tasks.
References
In particular, it remains unclear how to effectively learn long videos interleaved with text, how to enable general-purpose multimodal interaction, and how to efficiently predict tens of thousands of visual tokens, which pose stringent demands on pre-training, post-training, and inference, respectively.
— Emu3.5: Native Multimodal Models are World Learners
(2510.26583 - Cui et al., 30 Oct 2025) in Section 1 (Introduction)