Enabling general-purpose multimodal interaction at scale

Determine effective approaches to enable general-purpose multimodal interaction within large-scale models trained on interleaved vision–language data, so that a single system can flexibly process and generate interleaved inputs and outputs across tasks.

Background

Beyond learning from long videos, the paper aims for a unified interface that naturally accepts and emits interleaved text and visual content across diverse tasks. The authors explicitly state uncertainty about how to achieve general-purpose multimodal interaction, indicating unresolved design and training principles for scalable, transferable interfaces.

Emu3.5 employs unified pretraining, supervised fine-tuning, and reinforcement learning to move toward this capability, but how to systematically enable and scale such interaction remains unclear and is identified as a foundational open question.

References

In particular, it remains unclear how to effectively learn long videos interleaved with text, how to enable general-purpose multimodal interaction, and how to efficiently predict tens of thousands of visual tokens, which pose stringent demands on pre-training, post-training, and inference, respectively.

— Emu3.5: Native Multimodal Models are World Learners (2510.26583 - Cui et al., 30 Oct 2025) in Section 1 (Introduction)

Enabling general-purpose multimodal interaction at scale

Sponsor

Background

References

Related Problems