Uni-MoE 2.0: Omnimodal MoE Architecture
- The paper introduces Uni-MoE 2.0, which advances multimodal LLMs by integrating a dynamic-capacity Mixture-of-Experts framework that efficiently processes up to 10 input modalities.
- It leverages a novel Omni-Modality 3D RoPE to align spatial, temporal, and sequential features, ensuring coherent self-attention across text, images, audio, speech, and video.
- Progressive training augmented with reinforcement learning techniques optimizes expert routing and calibration, yielding competitive performance across 85 diverse benchmarks.
Uni-MoE 2.0 denotes an open-source omnimodal large model architecture and training framework that advances unified multimodal LLMs (MLLMs) through a dynamic-capacity Mixture-of-Experts (MoE) design, unified spatio-temporal positional encoding (Omni-Modality 3D RoPE), and a progressive multimodal training regime including advanced reinforcement learning techniques. The system builds upon the Lychee family of models, specifically evolving from the original Uni-MoE (Li et al., 18 May 2024), and targets language-centric understanding, reasoning, and generation spanning text, images, audio, speech, and video. The foundation of Uni-MoE 2.0 is a dense Qwen2.5-7B backbone, with architectural and algorithmic innovations enabling efficient and high-fidelity omnimodal representations, controllable generation, and scalable performance across 85 diverse evaluation benchmarks (Li et al., 16 Nov 2025).
1. Dynamic-Capacity Mixture-of-Experts Framework
Uni-MoE 2.0 introduces a MoE mechanism balancing computational efficiency and capability for up to 10 cross-modal input types through three expert roles: routed experts (activated adaptively, typically domain-specific), shared experts (global, always-on knowledge), and null experts (zero output, enabling selective skipping). The router computes for each token the logits ; after softmax, this yields for routeable experts. A permutation sorts in descending order, and for each token , is the smallest satisfying (with a threshold hyperparameter), specifying the routed experts . A fixed set of shared experts is always activated.
The MoE block forward pass for token aggregates outputs as , using a masked softmax for weights. Non-differentiable Top- selection is addressed by a straight-through ODE-inspired estimator ("GRIN"), enabling effective gradient propagation:
where indicates if an expert is argmax-selected. Computationally, dynamic expert selection leads to experts per token (in practice, , yields $1.5$–$3$ experts per token), substantially reducing compute versus fixed- MoE (Li et al., 16 Nov 2025).
2. Omni-Modality 3D Rotary Position Encoding
To enable self-attention across modalities with disparate spatial and temporal structures, Uni-MoE 2.0 generalizes standard 1D Rotary Position Embedding (RoPE) to a 3D scheme. Each attention pair multiplies by
where are rotation matrices parameterized by absolute position IDs. Text, audio, image, and video positions are mapped as follows:
- Text: is sequence index, .
- Audio: derived from absolute time, .
- Image: fixed , is row, is column.
- Video: combines image 2D RoPE per frame with incremented every 2 seconds aligned with audio.
This scheme maintains alignment among spatial, temporal, and sequential modalities, ensuring properly integrated cross-attention for omnimodal comprehension and generation.
3. Progressive, Reinforcement-Augmented Multimodal Training
Training of Uni-MoE 2.0 proceeds in several distinct, escalation stages:
- Cross-Modal Pretraining: All perception encoders (e.g., for images, audio) are frozen and multimodal inputs are mapped to text descriptions.
- Supervised Fine-Tuning:
- Warm-up: Three dense experts trained on single-modality data.
- MoE Fine-tuning: Pre-trained experts initialize MoE layers, which are trained on mixed, instruction-style data across modalities.
- Annealing Stage: Balanced sampling (~5B tokens per major modality) calibrates expert usage and distribution.
- Omnimodal RL:
- Generative Specialization: Base model frozen; MoE-TTS (text-to-speech) and Task-DiT (image generation) receive targeted finetuning (<1B tokens).
4. Multimodal Data Regimen and Generative Tokens
Uni-MoE 2.0 is trained on approximately 75B open-source multimodal tokens, derived from:
| Modality | Pretrain (B tokens) | SFT (B tokens) | Annealing (B tokens) |
|---|---|---|---|
| Image | 13 | 22 | 5 |
| Video | 0.2 | 19 | 5 |
| Audio | 16 (15 ASR, 1 caption) | 5 | 6 |
| Text | – | 1 | 4 |
Special-purpose generative tokens enable explicit text conditioning for image and speech output, e.g., <speech start>, <speech_timbre=Jenny>, <speech_prompt>, <speech_end> sequence for TTS; <TASK[i]> and <IMG[i]> for image synthesis/editing via Task-DiT and a frozen PixWizard DiT. This construction facilitates harmonized conditional generation across modalities (Li et al., 16 Nov 2025).
5. Benchmark Evaluation and Performance
Comprehensive evaluation on 85 public benchmarks demonstrates that Uni-MoE 2.0 establishes SOTA or highly competitive status relative to Qwen2.5-Omni (which is trained on 1.2T tokens), Ming-Lite-Omni, Baichuan-Omni 1.5, and MiniCPM-o 2.6. Summary comparisons:
| Task/Metric | Uni-MoE 2.0-Omni | Qwen2.5-Omni | Margin |
|---|---|---|---|
| Video understanding (Video-MME) | 66.4 | 59.8 | +6.6 |
| Spatial video reasoning (VSI-Bench) | 56.0 | 19.3 | +36.7 |
| ASR WER↓ (Libri-clean/other-long) | 2.04/4.20 | 7.73/7.98 | -5.69/-3.78 |
| Image editing (GEdit-Bench) | 6.02 | 7.42 | -1.40 |
| Denoising (PSNR) | 25.70 | 22.19 | +3.51 |
Across 76 key tasks, Uni-MoE 2.0 leads on over 50: +7% average on video understanding (n=8), +7% on omnimodal tasks, +4% on audiovisual reasoning, and 4.2% lower WER on long-form speech. Performance is also robust for image generation, low-level restoration, speech QA, and controllable depth-to-image (Li et al., 16 Nov 2025).
6. Comparative Analysis and Foundations
Relative to the original Uni-MoE (Li et al., 18 May 2024), Uni-MoE 2.0 extends core principles by:
- Advancing static sparse MoE routing to dynamic-capacity, balancing routed and shared/global expert activation.
- Expanding sparse MoE to handle up to 10 input modalities via omnidirectional 3D RoPE and task-specific generative tokens.
- Generalizing progressive training: original LoRA-tuned multi-modal SFT is expanded with direct reinforcement (GSPO/DPO) and explicit expert calibration via balanced data annealing.
- Achieving parameter and computational efficiency; e.g., with dynamic expert allocation (1.5–3 per token) compared to previous fixed-top- routing.
The result is unified coverage, efficiency, and transferability across diverse multimodal and omnimodal evaluation paradigms, while remaining substantially open-source.
7. Implications and Future Directions
Uni-MoE 2.0 empirically validates the effectiveness of dynamic-capacity routing, multimodal-aligned position encoding, and reinforcement-augmented progressive training for scalable OLMs. The explicit use of generative tokens and balanced cross-modal data annotation addresses historic mode collapse and performance biases observed in prior MLLMs. Future research will plausibly extend to further expert-role diversification, hierarchical MoE gating, expansion to medical and spatiotemporal modalities, and more refined alignment between cross-modal representations and expert selection, building upon the flexible, open foundation defined by Uni-MoE 2.0 (Li et al., 16 Nov 2025, Li et al., 18 May 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free