Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Uni-MoE 2.0: Omnimodal MoE Architecture

Updated 18 November 2025
  • The paper introduces Uni-MoE 2.0, which advances multimodal LLMs by integrating a dynamic-capacity Mixture-of-Experts framework that efficiently processes up to 10 input modalities.
  • It leverages a novel Omni-Modality 3D RoPE to align spatial, temporal, and sequential features, ensuring coherent self-attention across text, images, audio, speech, and video.
  • Progressive training augmented with reinforcement learning techniques optimizes expert routing and calibration, yielding competitive performance across 85 diverse benchmarks.

Uni-MoE 2.0 denotes an open-source omnimodal large model architecture and training framework that advances unified multimodal LLMs (MLLMs) through a dynamic-capacity Mixture-of-Experts (MoE) design, unified spatio-temporal positional encoding (Omni-Modality 3D RoPE), and a progressive multimodal training regime including advanced reinforcement learning techniques. The system builds upon the Lychee family of models, specifically evolving from the original Uni-MoE (Li et al., 18 May 2024), and targets language-centric understanding, reasoning, and generation spanning text, images, audio, speech, and video. The foundation of Uni-MoE 2.0 is a dense Qwen2.5-7B backbone, with architectural and algorithmic innovations enabling efficient and high-fidelity omnimodal representations, controllable generation, and scalable performance across 85 diverse evaluation benchmarks (Li et al., 16 Nov 2025).

1. Dynamic-Capacity Mixture-of-Experts Framework

Uni-MoE 2.0 introduces a MoE mechanism balancing computational efficiency and capability for up to 10 cross-modal input types through three expert roles: routed experts (activated adaptively, typically domain-specific), shared experts (global, always-on knowledge), and null experts (zero output, enabling selective skipping). The router computes for each token xx the logits r=Wrxr = W_r x; after softmax, this yields pj=Softmax(r)jp_j = \operatorname{Softmax}(r)_j for NrN_r routeable experts. A permutation π\pi sorts pp in descending order, and for each token ii, kik_i is the smallest kk satisfying j=1kpπ(j)P\sum_{j=1}^k p_{\pi(j)} \geq P (with PP a threshold hyperparameter), specifying the routed experts Ri={π(1),,π(ki)}R_i = \{\pi(1),\ldots,\pi(k_i)\}. A fixed set SS of NsN_s shared experts is always activated.

The MoE block forward pass for token xix_i aggregates outputs as oi=jRiSpjExpertj(xi)o_i = \sum_{j \in R_i \cup S} p_j \cdot \text{Expert}_j(x_i), using a masked softmax for weights. Non-differentiable Top-PP selection is addressed by a straight-through ODE-inspired estimator ("GRIN"), enabling effective gradient propagation:

θL=EBBernoulli(5/8)[2δmaxf()+(1δmax)(64)f()],\nabla_\theta L = \mathbb{E}_{B \sim \text{Bernoulli}(5/8)} \left[ 2\cdot \delta_{\max} \cdot \partial f(\ldots) + (1-\delta_{\max})(6-4)\cdot \partial f(\ldots) \right],

where δmax\delta_{\max} indicates if an expert is argmax-selected. Computationally, dynamic expert selection leads to Ns+Ei[ki]N_s + \mathbb{E}_i[k_i] experts per token (in practice, Ns=2N_s=2, P=0.7P=0.7 yields $1.5$–$3$ experts per token), substantially reducing compute versus fixed-KK MoE (Li et al., 16 Nov 2025).

2. Omni-Modality 3D Rotary Position Encoding

To enable self-attention across modalities with disparate spatial and temporal structures, Uni-MoE 2.0 generalizes standard 1D Rotary Position Embedding (RoPE) to a 3D scheme. Each attention pair (2d,2d+1)(2d, 2d+1) multiplies by

R3D(pos)=Rt(post)Rh(posh)Rw(posw),R_{3D}(\text{pos}) = R_t(\text{pos}_t)\otimes R_h(\text{pos}_h)\otimes R_w(\text{pos}_w),

where RR_* are 2×22\times 2 rotation matrices parameterized by absolute position IDs. Text, audio, image, and video positions are mapped as follows:

  • Text: post\text{pos}_t is sequence index, posh=posw=0\text{pos}_h=\text{pos}_w=0.
  • Audio: post\text{pos}_t derived from absolute time, posh=posw=post\text{pos}_h = \text{pos}_w = \text{pos}_t.
  • Image: fixed post\text{pos}_t, posh\text{pos}_h is row, posw\text{pos}_w is column.
  • Video: combines image 2D RoPE per frame with post\text{pos}_t incremented every 2 seconds aligned with audio.

This scheme maintains alignment among spatial, temporal, and sequential modalities, ensuring properly integrated cross-attention for omnimodal comprehension and generation.

3. Progressive, Reinforcement-Augmented Multimodal Training

Training of Uni-MoE 2.0 proceeds in several distinct, escalation stages:

  1. Cross-Modal Pretraining: All perception encoders (e.g., for images, audio) are frozen and multimodal inputs are mapped to text descriptions.
  2. Supervised Fine-Tuning:
    • Warm-up: Three dense experts trained on single-modality data.
    • MoE Fine-tuning: Pre-trained experts initialize MoE layers, which are trained on mixed, instruction-style data across modalities.
  3. Annealing Stage: Balanced sampling (~5B tokens per major modality) calibrates expert usage and distribution.
  4. Omnimodal RL:
    • GSPO (Group Sequence Policy Optimization): Maximizes expected reward on roll-outs using KL regularization, LGSPO(θ)=E[r(τ)logπθ(τ)]+λKL(πθπref)L_{GSPO}(\theta) = -\mathbb{E}[r(\tau)\log\pi_\theta(\tau)] + \lambda\,\mathrm{KL}(\pi_\theta\|\pi_{\text{ref}}).
    • DPO (Direct Preference Optimization): Optimizes preferences between output pairs, with LDPO(θ)=E(x,y+,y)[logσ(sθ(x,y+)sθ(x,y))]L_{DPO}(\theta) = -\mathbb{E}_{(x,y^+,y^-)}\left[ \log \sigma(s_\theta(x,y^+) - s_\theta(x,y^-)) \right]. These methods are alternated for stability and improved multimodal reasoning.
  5. Generative Specialization: Base model frozen; MoE-TTS (text-to-speech) and Task-DiT (image generation) receive targeted finetuning (<1B tokens).

4. Multimodal Data Regimen and Generative Tokens

Uni-MoE 2.0 is trained on approximately 75B open-source multimodal tokens, derived from:

Modality Pretrain (B tokens) SFT (B tokens) Annealing (B tokens)
Image 13 22 5
Video 0.2 19 5
Audio 16 (15 ASR, 1 caption) 5 6
Text 1 4

Special-purpose generative tokens enable explicit text conditioning for image and speech output, e.g., <speech start>, <speech_timbre=Jenny>, <speech_prompt>, <speech_end> sequence for TTS; <TASK[i]> and <IMG[i]> for image synthesis/editing via Task-DiT and a frozen PixWizard DiT. This construction facilitates harmonized conditional generation across modalities (Li et al., 16 Nov 2025).

5. Benchmark Evaluation and Performance

Comprehensive evaluation on 85 public benchmarks demonstrates that Uni-MoE 2.0 establishes SOTA or highly competitive status relative to Qwen2.5-Omni (which is trained on 1.2T tokens), Ming-Lite-Omni, Baichuan-Omni 1.5, and MiniCPM-o 2.6. Summary comparisons:

Task/Metric Uni-MoE 2.0-Omni Qwen2.5-Omni Margin
Video understanding (Video-MME) 66.4 59.8 +6.6
Spatial video reasoning (VSI-Bench) 56.0 19.3 +36.7
ASR WER↓ (Libri-clean/other-long) 2.04/4.20 7.73/7.98 -5.69/-3.78
Image editing (GEdit-Bench) 6.02 7.42 -1.40
Denoising (PSNR) 25.70 22.19 +3.51

Across 76 key tasks, Uni-MoE 2.0 leads on over 50: +7% average on video understanding (n=8), +7% on omnimodal tasks, +4% on audiovisual reasoning, and 4.2% lower WER on long-form speech. Performance is also robust for image generation, low-level restoration, speech QA, and controllable depth-to-image (Li et al., 16 Nov 2025).

6. Comparative Analysis and Foundations

Relative to the original Uni-MoE (Li et al., 18 May 2024), Uni-MoE 2.0 extends core principles by:

  • Advancing static sparse MoE routing to dynamic-capacity, balancing routed and shared/global expert activation.
  • Expanding sparse MoE to handle up to 10 input modalities via omnidirectional 3D RoPE and task-specific generative tokens.
  • Generalizing progressive training: original LoRA-tuned multi-modal SFT is expanded with direct reinforcement (GSPO/DPO) and explicit expert calibration via balanced data annealing.
  • Achieving parameter and computational efficiency; e.g., with dynamic expert allocation (1.5–3 per token) compared to previous fixed-top-KK routing.

The result is unified coverage, efficiency, and transferability across diverse multimodal and omnimodal evaluation paradigms, while remaining substantially open-source.

7. Implications and Future Directions

Uni-MoE 2.0 empirically validates the effectiveness of dynamic-capacity routing, multimodal-aligned position encoding, and reinforcement-augmented progressive training for scalable OLMs. The explicit use of generative tokens and balanced cross-modal data annotation addresses historic mode collapse and performance biases observed in prior MLLMs. Future research will plausibly extend to further expert-role diversification, hierarchical MoE gating, expansion to medical and spatiotemporal modalities, and more refined alignment between cross-modal representations and expert selection, building upon the flexible, open foundation defined by Uni-MoE 2.0 (Li et al., 16 Nov 2025, Li et al., 18 May 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Uni-MoE 2.0.