Emu3.5: Unified Multimodal World Model

Updated 31 October 2025

Emu3.5 is a large-scale multimodal world model integrating vision and language data for unified next-token prediction and long-horizon reasoning.
The model employs a 34.1B parameter decoder-only Transformer with Discrete Diffusion Adaptation, achieving a 20× speed improvement in image token generation.
It uses reinforcement learning post-training and extensive video-text data to enable robust world modeling, any-to-image editing, and embodied planning.

Emu3.5 is a large-scale native multimodal world model designed to perform unified, next-token prediction across both vision and language sequences, enabling complex long-horizon reasoning, generation, and world simulation. It represents a substantial advancement in scalable multimodal AI, leveraging vast interleaved video-text data, highly optimized inference (Discrete Diffusion Adaptation), and rigorous reinforcement learning post-training for generalizable world modeling (Cui et al., 30 Oct 2025).

1. Model Architecture and Representational Space

Emu3.5 utilizes a decoder-only Transformer architecture scaled to 34.1 billion parameters, incorporating 64 layers with hidden size 5,120 and intermediate size 25,600. Grouped Query Attention (64 attention heads, 8 key-value heads) facilitates computationally efficient long-context processing, while RMSNorm (pre-normalization) and QK-Norm stabilize activation and attention statistics.

Vision and language tokens are jointly embedded:

Vocabulary: 282,926 tokens (151,854 text, 131,072 vision via IBQ; Editor’s term: “joint vocab space”)
Context Length: Up to 32,768 tokens (suitable for extended multimodal documents and video)
Positional Encoding: Rotary Positional Embedding (RoPE), supporting long-horizon temporal reasoning
Visual Tokenizer: Index Backpropagation Quantization (IBQ), codebook size 131,072, vector dim 256
Activation: SwiGLU
Dropout: 0.1

This configuration enables high-resolution image encoding (up to 1024–2048 px), compact vision representation, and tight integration with multilingual text input (QwenTokenizer for text).

Parameter	Value
Parameters	34.1B
Layers	64
Vocabulary	282,926
Context Len	32,768

2. Unified Training Paradigm and Data Corpus

Emu3.5 is trained end-to-end with a single cross-entropy loss for next-token prediction on interleaved vision-language data. The core objective is: $\mathcal{L} = -\sum_{t=1}^{T} \omega_t \log p(y_t|\mathbf{x}_{<t})$ where $\omega_t = 0.5$ for vision tokens, $1.0$ for text (modality balancing).

All communication with the model is in document-style sequences:

1	[IMAGE TOKEN 1, ..., IMAGE TOKEN N, TEXT TOKEN 1, ..., TEXT TOKEN K, IMAGE TOKEN N+1, ...]

Special tokens enable precise control of multimodal switching and document formatting.

The training corpus exceeds 13 trillion tokens, primarily from video-centric web sources (>10T in pretraining, 55% video-data, 63M videos). Videos are segmented and paired with Whisper-generated transcripts, covering domains such as education, how-to, science, sports, and more. Image-text pairs (500M), video-text pairs (30M), and X2I data (27.35M) supplement the corpus.

Data undergoes basic and advanced filtering, annotation, and augmentation:

Resolution/duration cut, speech balance, talking head filter
Scene segmentation, keyframe extraction
LLM-generated summaries, Qwen2.5-VL captioning, multimodal contextual augmentation

Training proceeds in two pretraining stages (video-centric then high-res image-titrated), followed by unified SFT (150B tokens, general multimodal tasks), and large-scale RL-based post-training (multi-task GRPO, reward normalization to [1, 10]). Multilingual text-only data provides foundation (3T tokens).

3. Discrete Diffusion Adaptation (DiDA): Efficient Parallel Inference

Emu3.5 addresses the classical autoregressive bottleneck—token-by-token decoding—by introducing Discrete Diffusion Adaptation (DiDA):

DiDA reframes image token prediction as bidirectional parallel denoising, while leaving text generation causal and sequential.
Entire image token blocks are initialized, then iteratively denoised using a discrete scheme (causal/bidirectional attention masks, token permutation within image block).
Attention masking: noisy image tokens attend causally to prior clean tokens and bidirectionally within the same image. Text uses standard causal masking.

This yields dramatic speedup: 4,096 image tokens in 10 seconds (1024 px images) versus ~120 seconds for vanilla autoregressive decoding—a ~20× acceleration, with no measurable loss in image quality (semantic/perceptual metrics match AR baseline).

DiDA leverages self-distillation for training consistency and is integrated within the FlagScale infrastructure (FSM scheduler, async request management, FP8 quantization).

Gen Method	Tokens	Time (s)	GenEval	DPG-Bench	GEdit-Bench G_O
AR	4,096	120	0.86	88.26	7.59
DiDA	4,096	10	0.86	87.46	7.56

4. Multimodal Generation, Reasoning, and World Modeling

Emu3.5 demonstrates strong capabilities across long-horizon multimodal sequence modeling:

Interleaved generation: Native support for multi-step, mixed image/text outputs. Visual narrative, guidance, world exploration, and embodied manipulation scenarios are tested via human/automatic preference benchmarks.
Any-to-Image (X2I): Robust image editing given arbitrary multimodal instructions (text, images). Consistently top performance on X2I benchmarks (ImgEdit, GEdit-Bench, ICE-Bench, etc.).
Text-to-Image (T2I): State-of-the-art scores (GenEval, TIIF, DPG-Bench). High compositionality in prompt-following and multi-region, long-text rendering (e.g., TIIF Basic: 87.05, Advanced: 84.65, Designer: 94.03).
World Modeling: Supports spatiotemporally coherent predicted environments, open-world navigation, and stepwise decomposition for embodied planning: $Sub_i = (l_i, O_{[t_{i-1}:t_i]}) \quad \text{with} \quad o_{t_i} \ \text{as keyframe}$

Performance vs Gemini 2.5 Flash Image: Emu3.5 shows superior results in world exploration and manipulation (win rates: 65.5% and 67.1%, respectively).

5. Post-Training, RL Techniques, and Mutual Task Improvement

Supervised fine-tuning (unified across T2I, VLQA, X2I, narrative, world modeling tasks) is supported by large-scale RL (Group Relative Policy Optimization). Task-specific rewards include CLIP and VLM-based alignment scores, OCR fidelity, face preservation, aesthetics, and functional metrics. Rewards are normalized for stability.

A notable consequence: cross-task mutual reinforcement (T2I post-training enriches narrative/story capabilities), reflecting scaling law improvements in loss/generalization as data and compute grow. Transfer learning between multimodal domains is facilitated by the unified architecture.

Fine visual reconstruction is enabled by IBQ visual tokenization and SigLIP-based distillation, supporting both vanilla and diffusion-based image/video decoding with LoRA distillation for reduced sampling step complexity.

6. Technical Comparison and Benchmarks

Emu3.5 matches or exceeds Gemini 2.5 Flash Image (Nano Banana) in image generation/editing, significantly outperforming prior open-source models (e.g., SDXL, Qwen-Image-Edit) in quality and inference speed. For interleaved multimodal generation, Emu3.5 demonstrates statistically significant preference rates and robustness.

Model	Overall	Basic	Advanced	Designer
Emu3.5	89.48	87.05	84.65	94.03
Gemini 2.5 FI	88.62	87.08	83.46	93.41

Emu3.5’s open-source release includes weights, code, tokenizer, vocabularies, data recipes, training/inference infrastructure, and DiDA implementation.

7. Significance and Outlook

Emu3.5 sets a new standard for native, next-token unified multimodal models:

Unified paradigm: All tasks—generation, perception, any-to-image editing, agentic reasoning—handled in a single discrete token space.
Scalable: Leverages unprecedented data scale (10T+ tokens), supports long-context modeling, high-resolution output.
Efficient: DiDA yields inference performance comparable to closed-source diffusion models, with full integration in open research pipelines.
World modeling: Inherits agentic capabilities suitable for simulation, embodied manipulation, and open-environment reasoning—framework aligns with emerging paradigm in video/world models (cf. Sora).
Open-source: Fosters reproducibility and community advancement.

A plausible implication is that Emu3.5 marks the transition point where unified multimodal next-token prediction rivals—and in several domains surpasses—compositional and diffusion architectures, providing a robust foundation for future multimodal embodied AI. The integrated world modeling and agentic capabilities further suggest applications in open-world simulation, robotic planning, and complex cross-domain multimodal reasoning.

PDF Markdown Chat (Pro)

References (1)

Emu3.5: Native Multimodal Models are World Learners (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Emu3.5.