Emu3.5: Unified Multimodal World Model
- Emu3.5 is a large-scale multimodal world model integrating vision and language data for unified next-token prediction and long-horizon reasoning.
- The model employs a 34.1B parameter decoder-only Transformer with Discrete Diffusion Adaptation, achieving a 20× speed improvement in image token generation.
- It uses reinforcement learning post-training and extensive video-text data to enable robust world modeling, any-to-image editing, and embodied planning.
Emu3.5 is a large-scale native multimodal world model designed to perform unified, next-token prediction across both vision and language sequences, enabling complex long-horizon reasoning, generation, and world simulation. It represents a substantial advancement in scalable multimodal AI, leveraging vast interleaved video-text data, highly optimized inference (Discrete Diffusion Adaptation), and rigorous reinforcement learning post-training for generalizable world modeling (Cui et al., 30 Oct 2025).
1. Model Architecture and Representational Space
Emu3.5 utilizes a decoder-only Transformer architecture scaled to 34.1 billion parameters, incorporating 64 layers with hidden size 5,120 and intermediate size 25,600. Grouped Query Attention (64 attention heads, 8 key-value heads) facilitates computationally efficient long-context processing, while RMSNorm (pre-normalization) and QK-Norm stabilize activation and attention statistics.
Vision and language tokens are jointly embedded:
- Vocabulary: 282,926 tokens (151,854 text, 131,072 vision via IBQ; Editor’s term: “joint vocab space”)
- Context Length: Up to 32,768 tokens (suitable for extended multimodal documents and video)
- Positional Encoding: Rotary Positional Embedding (RoPE), supporting long-horizon temporal reasoning
- Visual Tokenizer: Index Backpropagation Quantization (IBQ), codebook size 131,072, vector dim 256
- Activation: SwiGLU
- Dropout: 0.1
This configuration enables high-resolution image encoding (up to 1024–2048 px), compact vision representation, and tight integration with multilingual text input (QwenTokenizer for text).
| Parameter | Value |
|---|---|
| Parameters | 34.1B |
| Layers | 64 |
| Vocabulary | 282,926 |
| Context Len | 32,768 |
2. Unified Training Paradigm and Data Corpus
Emu3.5 is trained end-to-end with a single cross-entropy loss for next-token prediction on interleaved vision-language data. The core objective is: where for vision tokens, $1.0$ for text (modality balancing).
All communication with the model is in document-style sequences:
1 |
[IMAGE TOKEN 1, ..., IMAGE TOKEN N, TEXT TOKEN 1, ..., TEXT TOKEN K, IMAGE TOKEN N+1, ...] |
The training corpus exceeds 13 trillion tokens, primarily from video-centric web sources (>10T in pretraining, 55% video-data, 63M videos). Videos are segmented and paired with Whisper-generated transcripts, covering domains such as education, how-to, science, sports, and more. Image-text pairs (500M), video-text pairs (30M), and X2I data (27.35M) supplement the corpus.
Data undergoes basic and advanced filtering, annotation, and augmentation:
- Resolution/duration cut, speech balance, talking head filter
- Scene segmentation, keyframe extraction
- LLM-generated summaries, Qwen2.5-VL captioning, multimodal contextual augmentation
Training proceeds in two pretraining stages (video-centric then high-res image-titrated), followed by unified SFT (150B tokens, general multimodal tasks), and large-scale RL-based post-training (multi-task GRPO, reward normalization to [1, 10]). Multilingual text-only data provides foundation (3T tokens).
3. Discrete Diffusion Adaptation (DiDA): Efficient Parallel Inference
Emu3.5 addresses the classical autoregressive bottleneck—token-by-token decoding—by introducing Discrete Diffusion Adaptation (DiDA):
- DiDA reframes image token prediction as bidirectional parallel denoising, while leaving text generation causal and sequential.
- Entire image token blocks are initialized, then iteratively denoised using a discrete scheme (causal/bidirectional attention masks, token permutation within image block).
- Attention masking: noisy image tokens attend causally to prior clean tokens and bidirectionally within the same image. Text uses standard causal masking.
This yields dramatic speedup: 4,096 image tokens in 10 seconds (1024 px images) versus ~120 seconds for vanilla autoregressive decoding—a ~20× acceleration, with no measurable loss in image quality (semantic/perceptual metrics match AR baseline).
DiDA leverages self-distillation for training consistency and is integrated within the FlagScale infrastructure (FSM scheduler, async request management, FP8 quantization).
| Gen Method | Tokens | Time (s) | GenEval | DPG-Bench | GEdit-Bench G_O |
|---|---|---|---|---|---|
| AR | 4,096 | 120 | 0.86 | 88.26 | 7.59 |
| DiDA | 4,096 | 10 | 0.86 | 87.46 | 7.56 |
4. Multimodal Generation, Reasoning, and World Modeling
Emu3.5 demonstrates strong capabilities across long-horizon multimodal sequence modeling:
- Interleaved generation: Native support for multi-step, mixed image/text outputs. Visual narrative, guidance, world exploration, and embodied manipulation scenarios are tested via human/automatic preference benchmarks.
- Any-to-Image (X2I): Robust image editing given arbitrary multimodal instructions (text, images). Consistently top performance on X2I benchmarks (ImgEdit, GEdit-Bench, ICE-Bench, etc.).
- Text-to-Image (T2I): State-of-the-art scores (GenEval, TIIF, DPG-Bench). High compositionality in prompt-following and multi-region, long-text rendering (e.g., TIIF Basic: 87.05, Advanced: 84.65, Designer: 94.03).
- World Modeling: Supports spatiotemporally coherent predicted environments, open-world navigation, and stepwise decomposition for embodied planning:
Performance vs Gemini 2.5 Flash Image: Emu3.5 shows superior results in world exploration and manipulation (win rates: 65.5% and 67.1%, respectively).
5. Post-Training, RL Techniques, and Mutual Task Improvement
Supervised fine-tuning (unified across T2I, VLQA, X2I, narrative, world modeling tasks) is supported by large-scale RL (Group Relative Policy Optimization). Task-specific rewards include CLIP and VLM-based alignment scores, OCR fidelity, face preservation, aesthetics, and functional metrics. Rewards are normalized for stability.
A notable consequence: cross-task mutual reinforcement (T2I post-training enriches narrative/story capabilities), reflecting scaling law improvements in loss/generalization as data and compute grow. Transfer learning between multimodal domains is facilitated by the unified architecture.
Fine visual reconstruction is enabled by IBQ visual tokenization and SigLIP-based distillation, supporting both vanilla and diffusion-based image/video decoding with LoRA distillation for reduced sampling step complexity.
6. Technical Comparison and Benchmarks
Emu3.5 matches or exceeds Gemini 2.5 Flash Image (Nano Banana) in image generation/editing, significantly outperforming prior open-source models (e.g., SDXL, Qwen-Image-Edit) in quality and inference speed. For interleaved multimodal generation, Emu3.5 demonstrates statistically significant preference rates and robustness.
| Model | Overall | Basic | Advanced | Designer |
|---|---|---|---|---|
| Emu3.5 | 89.48 | 87.05 | 84.65 | 94.03 |
| Gemini 2.5 FI | 88.62 | 87.08 | 83.46 | 93.41 |
Emu3.5’s open-source release includes weights, code, tokenizer, vocabularies, data recipes, training/inference infrastructure, and DiDA implementation.
7. Significance and Outlook
Emu3.5 sets a new standard for native, next-token unified multimodal models:
- Unified paradigm: All tasks—generation, perception, any-to-image editing, agentic reasoning—handled in a single discrete token space.
- Scalable: Leverages unprecedented data scale (10T+ tokens), supports long-context modeling, high-resolution output.
- Efficient: DiDA yields inference performance comparable to closed-source diffusion models, with full integration in open research pipelines.
- World modeling: Inherits agentic capabilities suitable for simulation, embodied manipulation, and open-environment reasoning—framework aligns with emerging paradigm in video/world models (cf. Sora).
- Open-source: Fosters reproducibility and community advancement.
A plausible implication is that Emu3.5 marks the transition point where unified multimodal next-token prediction rivals—and in several domains surpasses—compositional and diffusion architectures, providing a robust foundation for future multimodal embodied AI. The integrated world modeling and agentic capabilities further suggest applications in open-world simulation, robotic planning, and complex cross-domain multimodal reasoning.