Papers
Topics
Authors
Recent
Search
2000 character limit reached

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

Published 28 Apr 2026 in cs.CL and cs.AI | (2604.25578v1)

Abstract: We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

Summary

  • The paper introduces efficient upcycling of dense models into sparse Mixture-of-Experts architectures, activating only ~5% of parameters per token for expert specialization.
  • It employs a multi-phase curriculum over 5.1 trillion tokens to boost performance in both high- and low-resource languages, setting new benchmarks in efficiency.
  • The open-source release and unsupervised expert routing demonstrate replicability and scalability, addressing multilinguality challenges without excessive compute.

Marco-MoE: Open Multilingual Mixture-of-Expert LLMs with Efficient Upcycling

Introduction

Marco-MoE establishes a new standard for open, efficient, and highly performant multilingual language modeling by leveraging Mixture-of-Experts (MoE) architectures with fine-grained expert specialization and an efficient upcycling methodology. The work addresses the inherent trade-off in scaling multilingual LLMs—broadening language coverage at fixed parameter budgets typically leads to diminished per-language proficiency, a phenomenon often termed the “curse of multilinguality.” Marco-MoE distinguishes itself by combining extreme sparsity (approximately 5% parameter activation per token), a rigorous multi-phase data curriculum encompassing 5.1T tokens, and full transparency of datasets, training recipes, and model weights.

The key methodological contributions are:

  • The first sparse multilingual upcycling paradigm targeting compact models.
  • Sub-matrix expert initialization, facilitating fine-grained specialization.
  • Full open-sourcing, enabling replicability, scrutiny, and extension.

Sparse MoE Architecture and Upcycling Methodology

Marco-MoE adopts a decoder-only transformer, replacing traditional FFN layers with highly sparse MoE layers, following the conditional computation principle. The architecture is optimized for efficiency and specialization: Marco-Nano-Base (8B total/0.6B active) and Marco-Mini-Base (17.3B total/0.86B active) variants demonstrate the design, activating only a fraction of parameters per token. Fine-grained expert specialization is implemented via sub-matrix weight splitting, rather than coarse FFN replication.

The upcycling pipeline comprises three steps:

  1. Partitioning pre-trained dense FFN weights into multiple sub-matrices (pseudo-experts).
  2. Drop-Upcycling: stochastic re-initialization of select weight subsets, calibrated by empirical mean and variance.
  3. Weight scaling by N1/3N^{1/3} (number of experts per layer), stabilizing optimization and rectifying magnitude mismatches due to the gating mechanism.

This approach demonstrates rapid convergence and robust expert specialization, outpacing vanilla dense-FFN or coarse-grained MoE upcycling baselines. Figure 1

Figure 1: Fine-grained upcycling strategy from dense to MoE models, enabling efficient and specialized initialization.

Pre-Training Data and Curriculum

A multi-stage, language-diversifying curriculum is central to Marco-MoE's training efficiency. Four sequential phases progressively expand language coverage and emphasize reasoning and cultural data:

  • Phase 1: High-resource languages, reasoned, and instruction data (2.4T tokens)
  • Phase 2: Upsampling reasoning and Chinese; reducing English data (1.7T)
  • Phase 3: Introduction of nine new medium-resource/low-resource languages (0.5T)
  • Phase 4: Focus on curated synthetic/cultural multilingual data (0.5T)

Empirical evaluation shows that staged curriculum switching yields monotonic gains on English, general multilingual, and regional/cultural benchmarks, saturating only when the model approaches hardware or data constraints. Figure 2

Figure 2

Figure 2

Figure 2: Evolution of data mixture ratios through pre-training phases, reflecting the shift toward broader and more specialized multilingual coverage.

Performance Evaluation and Analysis

Overall Results

Benchmarked against a suite of strong open-weight LLMs (Qwen3, Gemma3, Trinity, Granite4, Llama3.2, SmolLM3, Tiny-Aya), Marco-MoE models (both Nano and Mini) exhibit the following properties:

  • Superior or state-of-the-art performance across English and multilingual tasks for their compute class.
  • Robust gains as the number of active parameters increases, demonstrating smooth scaling.
  • Outperformance of dense models with 3-14× more active parameters, especially on efficiency metrics. Figure 3

    Figure 3: Marco-MoE-Instruct models outperform open-weight baselines on English, generic multilingual, and regional benchmarks, despite low parameter activation.

Efficiency and Scaling

Marco-Mini-Base (0.86B active) and Marco-Nano-Base (0.6B active) set a new efficiency frontier, dominating in performance-to-FLOP ratio. Notably, these models excel in long-tail and low-resource languages, with the performance gap increasing inversely with language resource availability. Figure 4

Figure 4

Figure 4: Performance-to-compute ratio (left) and simultaneous proficiency in English and multilingual evaluation (right) highlight superior scaling and generalization for Marco-MoE.

Figure 5

Figure 5: Geographic region-specific multilingual performance demonstrates Marco-MoE’s superiority, especially in resource-scarce language families.

Expert Routing, Linguistic Structure, and Language Scaling

Hierarchical analysis of expert activation signatures shows that the routing mechanism in Marco-MoE models strongly correlates with known language families. Cross-lingual transfer arises naturally for Romance, Germanic, Slavic, Austronesian, and Indic languages, whereas typologically isolated languages (e.g., Thai, Vietnamese, Arabic, Hebrew) induce unique expert pathologies—minimizing interference. Figure 6

Figure 6: Language-labeled expert activation correlations demonstrate clustering by linguistically-related groups and specialization for isolated languages.

Figure 7

Figure 7: Hierarchical clustering of expert routing recovers canonical language families, illustrating unsupervised phylogenetic structure in routing patterns.

Further, the framework easily scales to 64 languages with consistent performance, with only a modest compute increase.

Post-Training: Instruction Tuning and On-Policy Distillation

Marco-MoE-Instruct models are produced via SFT on curated instruction, reasoning, and regional data, followed by cascaded On-Policy Distillation (OPD) from strong teacher models (Qwen3-30B, Qwen3-80B). OPD leverages trajectories sampled from the student’s policy, providing dense, high-signal updates that correct exposure bias and distribution shift inherent to SFT and vanilla off-policy distillation. This yields consistent incremental performance improvements for both general and cultural evaluation axes.

Open Research Implications

Marco-MoE's methodological openness—full data, recipe, code, and model weight release—enables reproducibility and broader scrutiny. The framework underscores MoE viability (especially fine-grained expertization and upcycling) as a path to defeating multilingual-capacity bottlenecks under constrained compute, rather than relying on massive-scale monolithic pretraining.

Strong empirical claims:

  • Marco-MoE-Instruct matches or surpasses the real-world usability of state-of-the-art baselines, despite operationalizing orders-of-magnitude fewer active parameters.
  • Structured expert routing trajectories inherently mirror typological language relationships without explicit supervision.
  • MoE enables genuinely additive growth to language coverage—scalable to at least 64 languages—without destructive performance interference, as typically observed in dense LLMs.

Future Directions

Key avenues for further research:

  • Extending Marco-MoE’s language reach into unseen, ultra-low-resource, or endangered language domains.
  • Dynamic, modular routing or expert expansion for incremental language addition without full model retraining.
  • Enhanced architectural or token-level routing mechanisms for efficient super-long input contexts and long-range dependency modeling.

Conclusion

Marco-MoE substantiates sparse fine-grained MoE architectures, upcycled from dense precursors, as a robust paradigm for efficient, open, and truly multilingual LLMs. Through architectural innovations, systematic multilingual dataset construction, multi-phase curriculum learning, and advanced distillation strategies, Marco-MoE sets a new practical and analytical benchmark in compact multilingual LLM research. By enabling community access to all components, it provides a reproducible foundation for ongoing expansion and scientific inquiry into high-performance, resource-efficient multilingual NLU.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.