Efficient prediction of very large numbers of visual tokens

Develop inference methods that efficiently predict tens of thousands of discrete visual tokens within next-token prediction architectures for high-resolution image and video generation.

Background

Autoregressive multimodal models suffer from token-by-token decoding bottlenecks, especially for high-resolution images and long sequences that require thousands of visual tokens. Although Emu3.5 introduces Discrete Diffusion Adaptation (DiDA) to accelerate image generation, the authors explicitly note that it remains unclear how to efficiently predict such large token counts in general.

Solving this problem is critical for practical deployment of native multimodal generators that must produce long, high-fidelity sequences under real-time or interactive constraints, and it impacts both architectural and systems-level design for inference.

References

In particular, it remains unclear how to effectively learn long videos interleaved with text, how to enable general-purpose multimodal interaction, and how to efficiently predict tens of thousands of visual tokens, which pose stringent demands on pre-training, post-training, and inference, respectively.

— Emu3.5: Native Multimodal Models are World Learners (2510.26583 - Cui et al., 30 Oct 2025) in Section 1 (Introduction)

Efficient prediction of very large numbers of visual tokens

Sponsor

Background

References

Related Problems