Efficient prediction of very large numbers of visual tokens
Develop inference methods that efficiently predict tens of thousands of discrete visual tokens within next-token prediction architectures for high-resolution image and video generation.
References
In particular, it remains unclear how to effectively learn long videos interleaved with text, how to enable general-purpose multimodal interaction, and how to efficiently predict tens of thousands of visual tokens, which pose stringent demands on pre-training, post-training, and inference, respectively.
— Emu3.5: Native Multimodal Models are World Learners
(2510.26583 - Cui et al., 30 Oct 2025) in Section 1 (Introduction)