MoE Upcycling: Transforming Dense Models

Updated 7 September 2025

MoE Upcycling is a technique that converts dense neural sub-layers into sparsely activated Mixture-of-Experts modules, enhancing scalability.
It leverages pretrained weights by partitioning and reusing them while introducing routers and diverse expert initialization methods like Drop-Upcycling and Delta Decomposition.
This approach delivers improved resource efficiency and performance boosts across language, vision, and multimodal applications.

The Mixture-of-Experts (MoE) upcycling technique is a class of methods for transforming a pretrained dense model into a sparsely activated MoE architecture by reusing, partitioning, or augmenting the existing model weights. Initially advanced in the context of neural language and vision models to address the prohibitive cost of large-scale pretraining, MoE upcycling leverages the “sunk cost” of dense model training to enable scalable, high-capacity networks without training from scratch. The technique is now foundational for resource-efficient scaling in LLMs, vision transformers, multimodal encoders, and domain-specialized systems.

1. Core Principles and Algorithmic Formulation

MoE upcycling centers on replacing selected dense sub-layers—typically feedforward multilayer perceptron (MLP) or feedforward network (FFN) modules in transformer architectures—with sparsely activated MoE equivalents. The process preserves the global architecture (e.g., number of layers, attention structure), augments the parameter count, and introduces conditional computation via routers or gating networks.

The generic upcycling workflow comprises:

Layer Selection and Replacement: Within the pretrained dense model, choose which MLP/FFN layers to convert to MoE. Common practice is to substitute approximately half of these layers, but configurations vary (Komatsuzaki et al., 2022).
Expert Initialization: For each layer to be upcycled, replace the dense MLP/FFN with an MoE block containing $E$ experts. Each expert is initialized by directly copying the dense model’s weights. Additional strategies include advanced allocation via importance-based weight sampling, co-activation clustering, or genetic parameter merging to increase expert diversity (Zhu et al., 2024, Hui et al., 2024).
Router Introduction and Initialization: A new router (gating mechanism) is introduced to direct tokens to experts. Typically, routers are randomly initialized, although advanced variants can be upcycled from attention heads or constructed using domain/task priors for adaptive routing (Ran et al., 31 Aug 2025, Zhou et al., 2024).

The routing mechanism yields a sparse activation pattern per token:

$\mathbf{R} \in \mathbb{R}^{n \times E}, \quad r_{ij} \ge 0, \quad \sum_{j=1}^E r_{ij} = 1$

for $n$ tokens and $E$ experts. The top- $K$ routing scheme activates only $K$ experts per token, with the effective output:

$x'_i = \sum_{j \in S_i} \frac{r_{ij}}{\sum_{k \in S_i} r_{ik}} e_j(x_i)$

where $S_i$ are the selected experts and $e_j(x_i)$ their outputs (Komatsuzaki et al., 2022).

Training is resumed (“continued pretraining” or “post-pretraining”), allowing the router and expert weights to diverge and specialize.

2. Model Surgery, Initialization Variants, and Parameter Efficiency

While basic upcycling duplicates dense weights into all experts, recent work has introduced several refinements to address redundancy and facilitate convergence:

Virtual Group Initialization: For “fine-grained” MoE architectures, the dense FFN is partitioned into $G$ shards, with the MoE constructed so each group collectively covers all shards. Router initialization is coordinated to maintain functional parity with the original dense function at startup (He et al., 2024).
Partial Re-initialization (Drop-Upcycling): To overcome slow convergence and lack of expert specialization arising from identical expert initialization, Drop-Upcycling randomly re-initializes a specified fraction $r$ of FFN weights per expert. This introduces diversity while retaining most pretrained knowledge. Optimal $r$ values (e.g., $r=0.5$ ) yield both rapid convergence and strong generalization (Nakamura et al., 26 Feb 2025).
Delta Decomposition (DeRS Paradigm): Given the high similarity of upcycled experts in early training, each expert’s weight is decomposed as $W_i = W_\text{base} + \Delta_i$ , with $\Delta_i$ stored or trained as a lightweight, sparse, or low-rank delta. This reduces parameter count while retaining expert flexibility (Huang et al., 3 Mar 2025).
Attention Parameter Upcycling: Beyond FFN/MLP layers, BAM (Branch-Attend-Mix) introduces the upcycling of specialist attention modules, initializing Mixture-of-Attention (MoA) layers directly from multiple independently fine-tuned dense experts. Query/output parameters can be upcycled individually; key/value parameters can be shared for memory efficiency (Zhang et al., 2024).

Tables below summarize several representative upcycling techniques:

Technique	Expert Initialization	Additional Features
Vanilla Upcycling	Duplicate dense FFN	Random router, MoE GLU/MLP
Virtual Group (He et al., 2024)	Sharded FFN  grouped experts	Router grouping, scaling factor
Drop-Upcycling (Nakamura et al., 26 Feb 2025)	Duplicate & partial reinit.	Expert-level randomness, $r$ -tuned
DeRS (Huang et al., 3 Mar 2025)	Base FFN + $\Delta_i$ light	Sparse/low-rank delta during upcycling
BAM (Zhang et al., 2024)	Full FFN + full attention	MoA soft routing, shared vs expert KV

3. Routing Mechanisms and Diversification Strategies

Early upcycling schemes used random router initialization with “Top-K” or “Expert Choice” dispatching. To enhance specialization and avoid expert collapse:

Gating Logit Normalization: Applied in Skywork-MoE, this rescales router logits:

$\tilde{z} = \lambda \frac{z - \mu}{\sigma}$

$g = \text{softmax}(\tilde{z})$

where $\lambda$ controls the distribution sharpness, increasing specialization and reducing token “dropouts” (Wei et al., 2024).

Mixture-of-Routers (Router Upcycling): Instead of a single linear router, multiple routers are initialized from attention heads of the dense model. Each token representation $x$ is projected into multiple queries $Q^{(j)} = W^{(j)} x$ , and routing is performed via an attention-like alignment with expert keys. The aggregation:

$S_i = \sum_j (Q^{(j)})^\top K_i / \sqrt{d'}$

$R = \text{softmax}(S)$

leads to more diverse, well-specialized expert assignments and improved convergence (Ran et al., 31 Aug 2025).

Domain/Task-aware Routing: Nexus and MoE-LPR incorporate domain or language priors into routing—via projected domain embeddings or designated “frozen” experts for original-language preservation—allowing for continual modular model extension and catastrophic forgetting mitigation (Gritsch et al., 2024, Zhou et al., 2024).

4. Empirical Performance and Resource Efficiency

A consistent finding across studies is that sparse upcycling achieves favorable trade-offs in compute versus downstream performance relative to both dense continuation and MoE-from-scratch baselines.

LLMs:

Upcycled T5 models (Base, Large, XL) outperform dense baselines on SuperGLUE by 1.5–2 points, requiring only 50% additional training computation versus the initial dense pretraining (Komatsuzaki et al., 2022).
Upcycling Nemotron-4 15B (E8) with virtual grouping achieves a 2.3% absolute MMLU improvement over continued dense training under the same training compute (He et al., 2024).
On large scale, upcycled Skywork-MoE and Llama 3 MoE both demonstrate strong downstream performance (e.g., $+2\%$ 0-shot MMLU versus Llama 3-8B with only $1\%$ of typical pretraining FLOPs) (Vavre et al., 2024).

Vision and Multimodal Models:

Upcycled ViT-B/16 models require only 13% of the dense training compute to achieve +1% improvement on 10-shot ImageNet compared to 58% for dense continuation (Komatsuzaki et al., 2022).
CLIP-MoE and CLIP-UP surpass dense and scratch-trained MoE counterparts by 6–7% Recall@1 on COCO while using only 30% inference FLOPs of larger dense models (Wang et al., 3 Feb 2025).

Resource Implications:

Drop-Upcycling and DeRS can match $2\times$ larger dense model performance or reduce parameter scale by $1000\times$ via lightweight delta decomposition (Nakamura et al., 26 Feb 2025, Huang et al., 3 Mar 2025).
UME achieves an 11.9% CER/WER reduction in ASR while reducing training time up to 86.7% compared to scratch (Fu et al., 2024).

Trade-offs:

Sparse upcycling can impose inference slowdowns up to 40% for larger models due to routing and increased active parameters, and is most advantageous when higher accuracy is desired under moderate compute budgets (Doubov et al., 2024).
Scaling laws reveal that upcycling’s efficiency diminishes with excessive sunk pretraining; there exists a critical threshold in additional upcycling tokens ( $D^*$ ) beyond which scratch training is preferable (Liew et al., 5 Feb 2025).

5. Extensions and Adaptive Approaches

Recent innovations further broaden the MoE upcycling paradigm:

Parameter and Routing Compression: DeRS directly compresses upcycled experts through sparsification or quantization of delta weights, enabling bandwidth- and storage-limited deployment with negligible loss (Huang et al., 3 Mar 2025).
Instruction Fine-tuning Upcycling: UpIT exploits intermediate instruction-tuning checkpoints as naturally diverse experts, applies genetic expert expansion for scalability, and seed-based router warm-up to maximize expert specialization with minimal data (Hui et al., 2024).
Multitask and Domain Adaptivity: Nexus enables open-ended expansion by efficiently integrating new, independently trained domain experts via adaptive embedding-based routing, supporting "plug-and-play" community model assembly (Gritsch et al., 2024).
Fine-grained Scientific Induction: The Innovator series introduces multi-stage upcycling (induction, splitting, routing warmup, integration) to build fine-grained science experts within hybrid MoE architectures, achieving 25% performance gains on scientific benchmarks with 99% retention of general ability (Liao et al., 24 Jul 2025).
Early Stopping and Undertraining: Recent evidence from pipeline analysis demonstrates that aggressive early stopping during expert fine-tuning improves downstream merging and MoE upcycling accuracy by avoiding over-memorization of hard examples, counter to prior assumptions (Horoi et al., 17 Jun 2025).

6. Practical Guidance and Future Research Directions

MoE upcycling has matured into a flexible, efficient paradigm for repurposing dense checkpoints into high-capacity, adaptable architectures, but best practices are nuanced:

Employ diversity-inducing strategies (partial re-initialization, instruction checkpoint selection, expert expansion) to avoid slow convergence and expert collapse.
Tune router initialization and optimization (normalization, domain embedding, mixture-of-attention) to ensure effective specialization and balanced load.
Balance the allocation of dense pretraining and additional upcycling based on empirical scaling laws to avoid diminishing returns (Liew et al., 5 Feb 2025).
For multi-task or continual learning, publish intermediate checkpoints and favor early-stopped experts for subsequent upcycling or merging (Horoi et al., 17 Jun 2025).
Parameter-efficient storage and adaptive extension via lightweight delta representations (DeRS), adjugate experts (Grove MoE), or modular domain embeddings (Nexus) are promising areas for further exploration.

The MoE upcycling framework continues to enable advances across language, vision, multimodal, and domain-specialized models, supporting scalable, efficient, and evolvable AI systems.