Uni-MoE 2.0: Omnimodal MoE Architecture

Updated 18 November 2025

The paper introduces Uni-MoE 2.0, which advances multimodal LLMs by integrating a dynamic-capacity Mixture-of-Experts framework that efficiently processes up to 10 input modalities.
It leverages a novel Omni-Modality 3D RoPE to align spatial, temporal, and sequential features, ensuring coherent self-attention across text, images, audio, speech, and video.
Progressive training augmented with reinforcement learning techniques optimizes expert routing and calibration, yielding competitive performance across 85 diverse benchmarks.

Uni-MoE 2.0 denotes an open-source omnimodal large model architecture and training framework that advances unified multimodal LLMs (MLLMs) through a dynamic-capacity Mixture-of-Experts (MoE) design, unified spatio-temporal positional encoding (Omni-Modality 3D RoPE), and a progressive multimodal training regime including advanced reinforcement learning techniques. The system builds upon the Lychee family of models, specifically evolving from the original Uni-MoE (Li et al., 2024), and targets language-centric understanding, reasoning, and generation spanning text, images, audio, speech, and video. The foundation of Uni-MoE 2.0 is a dense Qwen2.5-7B backbone, with architectural and algorithmic innovations enabling efficient and high-fidelity omnimodal representations, controllable generation, and scalable performance across 85 diverse evaluation benchmarks (Li et al., 16 Nov 2025).

1. Dynamic-Capacity Mixture-of-Experts Framework

Uni-MoE 2.0 introduces a MoE mechanism balancing computational efficiency and capability for up to 10 cross-modal input types through three expert roles: routed experts (activated adaptively, typically domain-specific), shared experts (global, always-on knowledge), and null experts (zero output, enabling selective skipping). The router computes for each token $x$ the logits $r = W_r x$ ; after softmax, this yields $p_j = \operatorname{Softmax}(r)_j$ for $N_r$ routeable experts. A permutation $\pi$ sorts $p$ in descending order, and for each token $i$ , $k_i$ is the smallest $k$ satisfying $\sum_{j=1}^k p_{\pi(j)} \geq P$ (with $P$ a threshold hyperparameter), specifying the routed experts $R_i = \{\pi(1),\ldots,\pi(k_i)\}$ . A fixed set $S$ of $N_s$ shared experts is always activated.

The MoE block forward pass for token $x_i$ aggregates outputs as $o_i = \sum_{j \in R_i \cup S} p_j \cdot \text{Expert}_j(x_i)$ , using a masked softmax for weights. Non-differentiable Top- $P$ selection is addressed by a straight-through ODE-inspired estimator ("GRIN"), enabling effective gradient propagation:

$\nabla_\theta L = \mathbb{E}_{B \sim \text{Bernoulli}(5/8)} \left[ 2\cdot \delta_{\max} \cdot \partial f(\ldots) + (1-\delta_{\max})(6-4)\cdot \partial f(\ldots) \right],$

where $\delta_{\max}$ indicates if an expert is argmax-selected. Computationally, dynamic expert selection leads to $N_s + \mathbb{E}_i[k_i]$ experts per token (in practice, $N_s=2$ , $P=0.7$ yields $1.5$–$3$ experts per token), substantially reducing compute versus fixed- $K$ MoE (Li et al., 16 Nov 2025).

2. Omni-Modality 3D Rotary Position Encoding

To enable self-attention across modalities with disparate spatial and temporal structures, Uni-MoE 2.0 generalizes standard 1D Rotary Position Embedding (RoPE) to a 3D scheme. Each attention pair $(2d, 2d+1)$ multiplies by

$R_{3D}(\text{pos}) = R_t(\text{pos}_t)\otimes R_h(\text{pos}_h)\otimes R_w(\text{pos}_w),$

where $R_*$ are $2\times 2$ rotation matrices parameterized by absolute position IDs. Text, audio, image, and video positions are mapped as follows:

Text: $\text{pos}_t$ is sequence index, $\text{pos}_h=\text{pos}_w=0$ .
Audio: $\text{pos}_t$ derived from absolute time, $\text{pos}_h = \text{pos}_w = \text{pos}_t$ .
Image: fixed $\text{pos}_t$ , $\text{pos}_h$ is row, $\text{pos}_w$ is column.
Video: combines image 2D RoPE per frame with $\text{pos}_t$ incremented every 2 seconds aligned with audio.

This scheme maintains alignment among spatial, temporal, and sequential modalities, ensuring properly integrated cross-attention for omnimodal comprehension and generation.

3. Progressive, Reinforcement-Augmented Multimodal Training

Training of Uni-MoE 2.0 proceeds in several distinct, escalation stages:

Cross-Modal Pretraining: All perception encoders (e.g., for images, audio) are frozen and multimodal inputs are mapped to text descriptions.
Supervised Fine-Tuning:
- Warm-up: Three dense experts trained on single-modality data.
- MoE Fine-tuning: Pre-trained experts initialize MoE layers, which are trained on mixed, instruction-style data across modalities.
Annealing Stage: Balanced sampling (~5B tokens per major modality) calibrates expert usage and distribution.
Omnimodal RL:
- GSPO (Group Sequence Policy Optimization): Maximizes expected reward on roll-outs using KL regularization, $L_{GSPO}(\theta) = -\mathbb{E}[r(\tau)\log\pi_\theta(\tau)] + \lambda\,\mathrm{KL}(\pi_\theta\|\pi_{\text{ref}})$ .
- DPO (Direct Preference Optimization): Optimizes preferences between output pairs, with $L_{DPO}(\theta) = -\mathbb{E}_{(x,y^+,y^-)}\left[ \log \sigma(s_\theta(x,y^+) - s_\theta(x,y^-)) \right]$ . These methods are alternated for stability and improved multimodal reasoning.
Generative Specialization: Base model frozen; MoE-TTS (text-to-speech) and Task-DiT (image generation) receive targeted finetuning (<1B tokens).

4. Multimodal Data Regimen and Generative Tokens

Uni-MoE 2.0 is trained on approximately 75B open-source multimodal tokens, derived from:

Modality	Pretrain (B tokens)	SFT (B tokens)	Annealing (B tokens)
Image	13	22	5
Video	0.2	19	5
Audio	16 (15 ASR, 1 caption)	5	6
Text	–	1	4

Special-purpose generative tokens enable explicit text conditioning for image and speech output, e.g., <speech start>, <speech_timbre=Jenny>, <speech_prompt>, <speech_end> sequence for TTS; <TASK[i]> and <IMG[i]> for image synthesis/editing via Task-DiT and a frozen PixWizard DiT. This construction facilitates harmonized conditional generation across modalities (Li et al., 16 Nov 2025).

5. Benchmark Evaluation and Performance

Comprehensive evaluation on 85 public benchmarks demonstrates that Uni-MoE 2.0 establishes SOTA or highly competitive status relative to Qwen2.5-Omni (which is trained on 1.2T tokens), Ming-Lite-Omni, Baichuan-Omni 1.5, and MiniCPM-o 2.6. Summary comparisons:

Task/Metric	Uni-MoE 2.0-Omni	Qwen2.5-Omni	Margin
Video understanding (Video-MME)	66.4	59.8	+6.6
Spatial video reasoning (VSI-Bench)	56.0	19.3	+36.7
ASR WER↓ (Libri-clean/other-long)	2.04/4.20	7.73/7.98	-5.69/-3.78
Image editing (GEdit-Bench)	6.02	7.42	-1.40
Denoising (PSNR)	25.70	22.19	+3.51

Across 76 key tasks, Uni-MoE 2.0 leads on over 50: +7% average on video understanding (n=8), +7% on omnimodal tasks, +4% on audiovisual reasoning, and 4.2% lower WER on long-form speech. Performance is also robust for image generation, low-level restoration, speech QA, and controllable depth-to-image (Li et al., 16 Nov 2025).

6. Comparative Analysis and Foundations

Relative to the original Uni-MoE (Li et al., 2024), Uni-MoE 2.0 extends core principles by:

Advancing static sparse MoE routing to dynamic-capacity, balancing routed and shared/global expert activation.
Expanding sparse MoE to handle up to 10 input modalities via omnidirectional 3D RoPE and task-specific generative tokens.
Generalizing progressive training: original LoRA-tuned multi-modal SFT is expanded with direct reinforcement (GSPO/DPO) and explicit expert calibration via balanced data annealing.
Achieving parameter and computational efficiency; e.g., with dynamic expert allocation (1.5–3 per token) compared to previous fixed-top- $K$ routing.

The result is unified coverage, efficiency, and transferability across diverse multimodal and omnimodal evaluation paradigms, while remaining substantially open-source.

7. Implications and Future Directions

Uni-MoE 2.0 empirically validates the effectiveness of dynamic-capacity routing, multimodal-aligned position encoding, and reinforcement-augmented progressive training for scalable OLMs. The explicit use of generative tokens and balanced cross-modal data annotation addresses historic mode collapse and performance biases observed in prior MLLMs. Future research will plausibly extend to further expert-role diversification, hierarchical MoE gating, expansion to medical and spatiotemporal modalities, and more refined alignment between cross-modal representations and expert selection, building upon the flexible, open foundation defined by Uni-MoE 2.0 (Li et al., 16 Nov 2025, Li et al., 2024).

PDF Markdown Chat (Pro)

References (2)

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts (2024)

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Uni-MoE 2.0.

Uni-MoE 2.0: Omnimodal MoE Architecture

1. Dynamic-Capacity Mixture-of-Experts Framework

2. Omni-Modality 3D Rotary Position Encoding

3. Progressive, Reinforcement-Augmented Multimodal Training

4. Multimodal Data Regimen and Generative Tokens

5. Benchmark Evaluation and Performance

6. Comparative Analysis and Foundations

7. Implications and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Uni-MoE 2.0: Omnimodal MoE Architecture

1. Dynamic-Capacity Mixture-of-Experts Framework

2. Omni-Modality 3D Rotary Position Encoding

3. Progressive, Reinforcement-Augmented Multimodal Training

4. Multimodal Data Regimen and Generative Tokens

5. Benchmark Evaluation and Performance

6. Comparative Analysis and Foundations

7. Implications and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research