Uni-MoE-2.0 Omni Model

Updated 19 November 2025

The paper introduces a unified omnimodal model that leverages dynamic sparse MoE layers and cross-modal alignment to excel across vision, speech, text, and molecular tasks.
The model employs a novel architecture with routed, shared, and null experts, ensuring efficient computational routing and expert utilization for modality-specific processing.
The progressive training strategy, including MoE fine-tuning and reinforcement learning, yields state-of-the-art results on 85 multimodal benchmarks and diverse downstream applications.

Uni-MoE-2.0-Omni represents a class of open-source omnimodal large models (OLMs) employing advanced Mixture-of-Experts (MoE) schemes to deliver unified, scalable performance on multimodal tasks. Architecturally, it extends dense transformer models (e.g., Qwen2.5-7B, LLaMA 3) via dynamic-capacity, sparse MoE layers, cross-modal alignment mechanisms, and progressive, multi-phase training regimes. The model is evaluated across a comprehensive suite of benchmarks in vision, speech, language, and cross-modal reasoning, where it consistently delivers state-of-the-art or highly competitive results compared to leading OLMs and multimodal transformers (Li et al., 16 Nov 2025, AI et al., 28 Oct 2025).

1. MoE Architecture: Dynamic Capacity and Expert Design

Uni-MoE-2.0-Omni replaces standard feed-forward networks in the transformer backbone with sparse, dynamically routed MoE layers. Three expert classes are instantiated:

Routed Experts: High-capacity, modality-specialized experts activated per token based on router decisions.
Shared Experts: Compact, generalist experts (each 1/8 the size of a routed expert) universally applied for cross-modal knowledge transfer.
Null Experts: Zero-parameter experts to enable token-level computational skipping, reducing redundant processing.

Routing operates by computing logits $z = W_r h + b_r$ for hidden state $h \in \mathbb{R}^d$ , yielding probabilities $p_j = \frac{\exp(z_j)}{\sum_{k=1}^{N_r} \exp(z_k)}$ , with expert selection governed by a Top- $P$ coverage criterion. For each token, the minimal set of experts is activated to ensure $\sum p_j \geq P$ . Null experts participate in routing as needed. The aggregate output is:

$\mathrm{MoE}(h) = \sum_{e \in \mathcal{R}_i \cup \mathcal{S}} p_e \mathrm{FFN}_e(h)$

where $\mathcal{R}_i$ and $\mathcal{S}$ denote routed and shared experts (Li et al., 16 Nov 2025).

Capacity is dynamically allocated per expert:

$C = \alpha \frac{M}{N}$

with $M$ tokens and $N$ experts, $\alpha \in (0,1)$ , balancing compute efficiency and expressivity (Li et al., 16 Nov 2025). Training incorporates auxiliary load-balancing losses to promote uniform expert utilization and prevent mode collapse (2502.01074, AI et al., 28 Oct 2025).

Input processing pipelines integrate diverse modalities using unified strategies:

Vision: Images and video frames are encoded via frozen vision backbones (e.g., Qwen2.5-VL), generating high-dimensional region tokens. For video, rotary positional encodings (VideoRoPE, 3D RoPE) impose temporal consistency.
Speech: Raw waveforms pass through Whisper encoders; outputs (mel embeddings) are projected into the model’s semantic space. Generation utilizes VAE-GAN based continuous latent tokens for higher-fidelity synthesis.
Text: Instructions and textual inputs are tokenized using subword vocabularies (SentencePiece/ByteT5, SELFIES for molecules) and projected for fusion.
Fusion: Embeddings from all modalities are concatenated in sequence, segmented by special tokens (e.g., “⟨img⟩,” “⟨aud⟩,” etc.), and processed through shared transformer layers. Modality-specific routers enable expert selection tailored to input type (AI et al., 28 Oct 2025).

For molecular tasks, SELFIES and molecular graphs (via RDKit and GNN encoders) provide universal string and graph representations, projected for alignment with LLM token spaces. Attention masks and padding manage multi-modality batching (2502.01074).

To align temporal, spatial, and modality-specific dependencies within self-attention, Uni-MoE-2.0-Omni applies 3D rotary positional embeddings:

Decomposition of rotary angles along temporal ( $i_t$ ), height ( $i_h$ ), and width ( $i_w$ ) axes:

$\theta^{(t)}_k,\; \theta^{(h)}_k,\; \theta^{(w)}_k$

where the embedding dimension is partitioned ( $d_t + d_h + d_w = d$ ) and summed per token for unified spatial-temporal alignment (Li et al., 16 Nov 2025).

Text: Aligned along the temporal axis.
Audio: Uses temporal positioning with fixed spatial coordinates.
Images: Constant temporal, positional height/width assignments per patch.
Video: Frames and audio are co-encoded by interleaving tokens along $i_t$ .

This scheme facilitates cross-modal interoperability and directly supports long-context video, audio, and image-text tasks.

4. Training Strategy: Progressive Phases and RL

Training is partitioned into four main phases:

Cross-Modal Pretraining: The dense backbone is frozen; modality-specific encoders (vision/audio) and Q-former layers learn to condition LLM token embeddings.
Expert Warm-Up: Dense “expert” models for speech, vision, and generation are fine-tuned to initialize routed expert parameters.
MoE Fine-Tuning + Annealing: The full MoE model is trained on mixed multimodal instruction data, with routed experts initialized and modalities balanced. Annealing rebalances data composition.
Omnimodal Reinforcement Learning: Group Sequence Policy Optimization (GSPO) maximizes expected rewards, followed by Direct Preference Optimization (DPO):

$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)} \left[\log \frac{p_\theta(y^+|x)}{p_\theta(y^-|x)}\right]$

using human or teacher-derived preferences to refine chain-of-thought reasoning and sequence-level task adherence (Li et al., 16 Nov 2025).

Curriculum alternates modality-specific batches and RL steps, with sample-level balancing across input types to assure equitable learning.

5. Dataset Curation and Data Selection

Uni-MoE-2.0-Omni is trained on $\approx75$ B tokens of open-source multimodal data, curated to maximize cross-modal alignment:

Image/Video/Text: 13B tokens pretraining, 22B tokens instruction SFT, 5B tokens annealing.
Audio (ASR, speech Q&A, music): Aggregated from corpora such as LibriSpeech and GigaSpeech, including 15B tokens ASR, 1B audio-caption, 5B for domain-specific Q&A.
Speech and Image Gen: Special tokens (<speech_start>…<speech_end>, <IMG>, <TASK>) are used for conditional generation with external diffusion models; 2B tokens single-speaker, 5B multi-speaker/style, 5B for TTS activation; 4.8M samples image generation, 5.7M editing, etc.

Preprocessing standardizes input (padding images, chunking audio, sampling video), and filtering removes low-quality samples. For molecular tasks, active-learning reduces needed samples to 40% of full training sets while maintaining performance (2502.01074).

6. Empirical Results and Benchmarking

Uni-MoE-2.0-Omni achieves state-of-the-art performance across 85 multimodal benchmarks (Li et al., 16 Nov 2025, AI et al., 28 Oct 2025):

Domain/Task	Benchmark / Metric	Uni-MoE-2.0-Omni Result	Notable Comparison
Video Understanding	Video-MME, VSI-Bench	+7% avg. (VSI ↑56.0 vs 19.3)	Qwen2.5-Omni
Omnimodal Comprehension	WorldSense, StreamingBench	+7% avg. (WorldSense↑44.7)	Qwen2.5-Omni
Audio ASR	LibriSpeech WER	↓4.2% (4.2 vs 7.98)	Qwen2.5-Omni
Image Generation	Wise, FID	Wise↑0.44, FID↓18.04	JanusPro, Bagel
Image Editing	GEdit-Bench, Emu	6.02 vs 3.20; Emu↑0.076 vs 0.039	PixWizard
Text→Speech	Seed-TTS-Eval	Zh-WER 0.99%, En-WER 1.59%	open-source best
Video→Text	StreamingMultiturnBench	Avg. 71.6 vs 67.9	Qwen2.5-Omni

Additional metrics include improvements in controllable image generation (e.g. Canny-FID, Depth-RMSE), restoration (SSIM/PSNR), and chain-of-thought reasoning (+5% on MathVista, enhanced text-to-image faithfulness).

For molecular tasks, robust gains are reported (forward reaction, retrosynthesis, reagent/catalyst prediction, yields regression, molecular captioning, procedure recovery), with stable gradient norms and evidence of a universal convergent molecular space (2502.01074).

7. Innovations, Scalability, and Future Directions

Key technical advances:

Generative Segmentation: Framed as semantic-preserving edits for simultaneous high-quality generation and pixel-wise comprehension.
High-Fidelity Text Rendering: Pixel-level control via Glyph encoders for text placement in images.
Identity Preservation: Composite VAE-based losses in editing tasks to maintain semantic integrity.
Continuous Speech Latents: VAE-GAN acoustic vectors enhance speech synthesis fidelity.
Dialect-aware ASR: ContextASR data and multi-dialect fine-tuning for robust entity recall.

Scaling laws demonstrate logarithmic performance growth with increased data and model size (2502.01074). Modular design facilitates adaptability to future modalities and larger expert counts, with dynamic MoE routing maintaining computational efficiency.

Empirical and architectural results collectively indicate that Uni-MoE-2.0-Omni delivers unified, scalable, and computationally efficient performance across the omnimodal spectrum, setting a foundation for future extensible OLMs and universal models for scientific and cross-domain reasoning.