Sigma-MoE-Tiny: Ultra Sparse MoE Model

Updated 19 December 2025

Sigma-MoE-Tiny is an ultra-sparse, decoder-only transformer model featuring a 40:1 parameter sparsity with 96 experts per layer and one active expert per token.
The architecture employs a per-token FP32 gating network and a progressive sparsification schedule to optimize expert selection and balance load across layers.
Benchmark results show Sigma-MoE-Tiny outperforms comparably-sized models on tasks like MMLU, GSM8K, and HumanEval, validating its design and sparsity strategies.

Sigma-MoE-Tiny is an open-source decoder-only transformer LLM utilizing an extreme-sparsity Mixture-of-Experts (MoE) architecture. Distinctive for its 40:1 parameter sparsity ratio and a design with 96 experts per MoE layer—of which only one expert is active for each token—Sigma-MoE-Tiny achieves a regime of maximum open-source MoE sparsity while claiming robust optimization stability and top-tier benchmark performance despite minimal parameter activation (Hu et al., 18 Dec 2025).

1. Architectural Structure

Sigma-MoE-Tiny features a 56-layer decoder-only transformer where every feed-forward block is implemented as a MoE module. The per-layer configuration comprises:

Hidden size $d_\text{model} = 1536$ .
Feed-forward intermediate size $d_\text{ff} = 768$ .
Self-attention with 16 query/key and 4 value heads.
20 billion total parameters, with only 0.5 billion activated per token (sparsity = 40).

Each MoE feed-forward includes $N_E = 96$ two-layer SwiGLU FFN experts. A token-level FP32 gating network computes per-token logits $z_{i, j} = w_i^\top h_j$ and gating probabilities $p_{i, j} = \frac{\exp(z_{i,j})}{\sum_{k=1}^{96} \exp(z_{k,j})}$ . Only the top-1 scoring expert ( $K=1$ ) is selected and activated for that token. This scheme maximizes both computational sparsity and expert specialization.

2. Sparsity Regime and Load Balancing

Sigma-MoE-Tiny reaches a sparsity metric of $s = \frac{20B}{0.5B} = 40$ . At this extreme level, expert selection becomes highly imbalanced, especially in lower layers, leading to notable challenges for conventional load-balance loss (LBL) as introduced in prior work [Fedus et al., 2022]. LBL is computed as:

$\text{LBL} = \frac{1}{N_E} \sum_{i=1}^{N_E} f_i p_i,$

where $f_i$ is the expert token fraction and $p_i$ is the mean gating probability over the batch. Under severe sparsity, this objective is minimized by uniform gating probabilities $p_i$ —allowing $f_i$ to be non-uniform, which can cause expert load collapse in lower layers. Load imbalance is quantified by the relative deviation $\Delta_i = (n_i - \bar{n}) / \bar{n}$ with $\bar{n} = N_B / N_E$ .

To resolve routing collapse and restore stable expert utilization, Sigma-MoE-Tiny adopts a progressive sparsification schedule in the first 8 layers, gradually reducing the number of active experts from $k_\ell(0) \in \{8,8,6,6,4,4,2,2\}$ to 1 across 90% of training, and maintaining $k_\ell=1$ for the remaining 48 layers throughout. This approach incurs only a temporary ~25% increase in active parameters and yields no long-term stability cost.

3. Training Regimen and Corpus

Pre-training employs both public corpora—deduplicated Nemotron-CC, DCLM, Fine Web-Edu—and proprietary synthetic data, ensuring broad domain coverage (general knowledge, math, and code). Optimization is performed with AdamW (parameters: $\beta_1=0.9$ , $\beta_2=0.95$ , $\epsilon=10^{-9}$ ; weight decay 0.1), gradient clipping at 1.0, and parameter initialization $N(0,0.02)$ . The maximum sequence length is 4K.

The learning rate follows a warmup-stable-decay policy: linear warmup to $2.6 \times 10^{-4}$ (2K steps), held stable for 60% of tokens, cosine decay to $1.6 \times 10^{-4}$ over 30%, and linear decay to $2.6 \times 10^{-5}$ in the final 10%. The batch size is grown from 1,920 to 7,680 over the first 40% of steps, then fixed. The LBL coefficient is $\lambda=10^{-3}$ . Progressive sparsification is employed for 90% of training, with full-Top-1 sparsity only in the final 10%. No irrecoverable loss spikes occur; expert assignments remain balanced.

4. Comparative Performance Metrics

Sigma-MoE-Tiny demonstrates competitive or superior performance against both dense and sparse models with higher active parameter counts. The following table summarizes key pre-trained ("Base") evaluation results:

Model	Activated / Total Params	MMLU (5-shot EM %)	GSM8K (8-shot EM %)	HumanEval (0-shot Pass@1)
Qwen3-0.6B	0.6 / 0.6 B	52.81	59.59	29.27
Gemma-3-4B	4.0 / 4.0 B	59.51	43.97	35.98
DeepSeek-V2-Lite	2.4 / 15.7 B	58.30	41.10	29.90
Sigma-MoE-Tiny	0.5 / 20 B	64.81	71.65	42.07

In ablation studies, switching to target sparsity at later points in training produces a performance drop of $\lesssim$ 0.3 points, validating the progressive schedule’s negligible impact on final accuracy.

Post-training ("Aligned") comparisons on competitive reasoning, mathematics, and coding benchmarks reveal continued leading results:

Model	Activated / Total Params	MMLU-Redux (avg@1)	Math-500 (avg@1)	HumanEval (avg@1)
DeepSeek-R1-Distill-Qwen-7B	7 / 7 B	68.5	92.8	64.0
Qwen3-1.7B	1.7 / 1.7 B	73.9	93.4	70.1
Phi-3.5-MoE	6.6 / 42 B	78.6	59.5	75.0
Sigma-MoE-Tiny	0.5 / 20 B	79.8	94.6	79.9

Sigma-MoE-Tiny thus matches or exceeds comparably-sized and much larger dense/sparse models.

5. Load Balancing: Analysis and Approaches

At extreme sparsity, standard LBL becomes insufficient as it can be minimized via uniform $p_i$ rather than enforcing true expert utilization balance ( $f_i$ ). In lower network layers, this produces substantial load imbalance. The progressive sparsification strategy reintroduces flexibility, enabling the routing network to better explore expert diversity and avoid collapse during initial optimization, with full sparsity deferred to late training. Alternative approaches, such as native Top-1 LBL (minimizing $\|\mathbf{f}\|^2$ ), may over-constrain specialization and harm performance. This suggests hybrid or temperature-annealed balancing objectives could further improve both stability and expert specialization.

6. Insights and Future Research Directions

Key findings include:

The 40:1 sparsity regime is trainable and can match or outperform larger dense and MoE models.
Progressive sparsification is critical to avoid load collapse and ensure balanced expert activation.
There is no long-term stability or performance loss from the adopted transition schedule—empirically, loss spikes are absent and metrics are robust.
The observed LBL “short-circuit” at high sparsity highlights an important failure mode; research should pursue router architectures or loss objectives that effectively balance utilization, particularly in lower layers.
Prospective directions include adaptive per-layer sparsity, dynamic expert capacity (growth/shrinkage), meta- or reinforcement-learning joint loss optimization, and router designs with targeted expert under-utilization bias.

Sigma-MoE-Tiny’s design and analytical results provide a reference point for next-generation ultra-sparse MoE architectures, demonstrating that principled scheduling, robust optimization, and careful load-balancing can enable extremely sparse transformers without sacrificing benchmark performance (Hu et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Sigma-Moe-Tiny Technical Report (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sigma-MoE-Tiny.