UniMoE-Audio: Unified Audio Generation Model

Updated 16 October 2025

UniMoE-Audio is a unified audio generation model that employs dynamic Top-P routing and hybrid expert roles to seamlessly synthesize both speech and music.
It uses a three-stage training curriculum—independent specialist training, MoE warmup, and joint training—to balance domain-specific learning with cross-domain synergy.
Experimental results show state-of-the-art performance in speech intelligibility and music aesthetics while efficiently managing computational resources.

UniMoE-Audio is a unified audio generation model that implements a Dynamic-Capacity Mixture-of-Experts (MoE) architecture to advance speech and music synthesis within a single framework. The distinguishing innovation is the dynamic expert allocation mechanism—particularly Top-P routing—and a hybrid expert design that together address the conflicting requirements and severe data imbalance between speech and music, two domains that have historically resisted successful unification in generative audio modeling (Liu et al., 15 Oct 2025). The resulting system delivers state-of-the-art performance across a suite of speech and music generation tasks by carefully staged training that preserves both domain-specialization and cross-domain synergy.

1. Dynamic-Capacity Mixture-of-Experts Framework

UniMoE-Audio’s core architecture is a Dynamic-Capacity Mixture-of-Experts (MoE) implemented within the Transformer feed-forward (FFN) sublayers. Input tokens $X \in \mathbb{R}^{N \times d}$ are processed by a gating module that produces per-token activation probabilities over $E$ experts:

$P = \mathrm{Softmax}(X W_g)$

where $W_g \in \mathbb{R}^{d \times E}$ is the trainable gating matrix.

Unlike conventional MoE systems that use a static Top-K routing per token, UniMoE-Audio employs a Top-P routing strategy: for each token, experts are sorted by their activation probabilities, and the smallest set $I$ is selected such that

$\sum_{i \in I} P_i \geq p$

where $p$ is a predefined threshold (not a hard number of experts). The token's FFN output is then

$O = \sum_{i \in I} \frac{P_i}{\sum_{j \in I} P_j} \cdot E_i(X)$

This dynamic allocation matches computation to input complexity: complex/ambiguous tokens trigger more experts, while simpler signals may use fewer.

Hybrid expert roles are critical:

Conditionally Routed Experts focus on domain-specific features—e.g., some experts specialize in speech, others in music—and are only activated when selected by the gating mechanism.
Shared Experts are always active, capturing domain-agnostic or common audio principles.
Null Experts are parameter-free and output zeros, enabling adaptive compute skipping for tokens not requiring further processing.

2. Three-Stage Training Curriculum

The training schedule directly addresses domain imbalance and promotes effective knowledge sharing:

a) Independent Specialist Training

Dense "proto-expert" models are pretrained independently on each domain’s (e.g., Chinese TTS, English TTS, text-to-music, video-to-music) imbalanced raw data. Each proto-expert thereby acquires intensive domain-specific parameterizations, without interference from other tasks.

b) MoE Integration and Warmup

Trained proto-experts' FFNs are split (typically halved, with each half forming a routed expert), while shared layers are initialized by parameter averaging. A balanced, high-quality subset of all domains is used to train only the new routing (gate) module and shared components, aligning dynamic routing to actual domain boundaries.

c) Synergistic Joint Training

The full UniMoE-Audio network undergoes end-to-end supervised fine-tuning over a carefully rebalanced corpus curated from all domains. An auxiliary load-balancing loss (gradually decayed during training) ensures that expert utilization remains balanced:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{gen}} + \lambda(t) \cdot \mathcal{L}_{\text{bal}}$

This balances cross-domain learning and prevents catastrophic forgetting or resource monopolization by high-data tasks.

3. Experimental Results and Benchmark Performance

UniMoE-Audio demonstrates strong results across both speech and music generation tasks. In speech synthesis, on SeedTTS-EN, LibriSpeech, and AISHELL-3:

Content intelligibility: Achieves state-of-the-art Word/Character Error Rate (WER/CER).
Perceptual quality: UTMOS of 4.36 on SeedTTS-EN, rivaling or exceeding state-specialized models.

In music generation:

Aesthetic metrics: Excels on benchmarks for Production Complexity, Production Quality, Content Enjoyment.
Semantic alignment: High CLAP and CLaMP3 scores establish strong correlation between generated music and text/video prompts.
Reference-based fidelity: Slightly trails in some Fréchet Audio Distance (FAD) and KL-divergence metrics, favoring more creative over imitative generation.

Compared to a dense Unify-Baseline trained on the same data, UniMoE-Audio avoids catastrophic task interference: naive joint models exhibit significant music generation degradation, while UniMoE-Audio maintains strong performance, indicating genuinely synergistic learning.

4. Expert Routing Dynamics and Synergistic Learning

The expert activation patterns provide direct insight into how domain conflicts are mitigated:

During balanced joint training, dynamic routing calibrated in the warmup stage maintains the original domain assignments, preserving specialist knowledge.
Visualizations show that distinct experts are predominantly allocated to their originating domains (some experts for speech tokens, others for music).
Shared experts facilitate cross-domain transfer, allowing the system to generalize and benefit both domains from overlapping representations.
Null experts, frequently activated for trivial or simple tokens, further reduce unnecessary computation—implicitly regularizing the model.

This approach systematically overcomes the tendency of large unified architectures to degrade in data-poor domains.

5. Implications and Prospects for Universal Audio Generation

The Top-P dynamic MoE design, combined with a staged curriculum, offers several strategic advantages for advancing universal audio generation:

Resource efficiency: Adaptive allocation of compute per token enables scalable large-model deployment and reduced inference latency.
Task conflict mitigation: Clear expert demarcation and calibrated routing insulate tasks from mutual interference, making joint training feasible even under severe data imbalance.
Extensibility: The modular expert pool concept lends itself to incorporating additional domains (e.g., environmental sounds, effects) or modalities (vision, text) by introducing new routed experts and retraining the gating module.
Guidance for future architectures: Analysis of expert utilization and computation skipping suggests directions for further optimizing large-scale, multi-domain generative models.

A plausible implication is that patterns established in UniMoE-Audio—dynamic routing, hybrid expert roles, and curriculum-based domain specialization—may generalize to other multimodal or multi-task generative systems, providing a robust template for unifying historically incompatible domains under a single generative model.

6. Summary Table: UniMoE-Audio Expert and Training Strategies

Component	Function	Implementation Detail
Top-P Routing	Dynamic expert allocation per token	$\sum_{i\in I} P_i \geq p$ , adaptive set I
Routed Experts	Domain-specific pattern extraction	Activated per gating, instantiated from split proto-experts
Shared Experts	Domain-agnostic knowledge sharing	Always active
Null Experts	Computation skipping for simple tokens	Output constant zeros
Specialist Training	Domain isolation, strong initial prototypes	Independent dense models per domain
MoE Warmup	Route calibration on balanced subset	Gate and shared expert training
Joint Training	End-to-end cross-domain optimization	Balanced dataset, load-balancing loss

7. Context and Significance

UniMoE-Audio represents a concrete solution to the integration of speech and music generation under one scalable, resource-efficient model, explicitly overcoming the main challenges of domain imbalance and antagonistic transfer that have historically limited progress in universal audio generation (Liu et al., 15 Oct 2025). The demonstrated synergy between data engineering, model design, and curriculum learning in this context suggests a template for further advances in unified content generation frameworks across modalities and domains.

PDF Markdown Chat (Pro)

References (1)

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UniMoE-Audio.