Sigma-MoE-Tiny: Sparse MoE Model

Updated 21 December 2025

Sigma-MoE-Tiny is an extremely sparse Mixture-of-Experts model that uses a 20B-parameter architecture with only 0.5B parameters activated per token.
It employs a 56-layer Transformer backbone with innovations like Group Query Attention and a progressive sparsification schedule to ensure effective load distribution and stability.
Empirical results demonstrate that Sigma-MoE-Tiny achieves superior reasoning and code generation performance, outperforming several dense and MoE baselines.

Sigma-MoE-Tiny is an extremely sparse Mixture-of-Experts (MoE) LLM designed to maximize parameter efficiency and computational throughput while maintaining competitive language modeling and reasoning performance. The model demonstrates that, with sophisticated architectural and training techniques, it is feasible for a 20-billion-parameter foundation model to activate as few as 0.5 billion parameters per token—achieving high empirical performance and stability under extreme sparsity constraints (Hu et al., 18 Dec 2025, Zadouri et al., 2023).

1. Architectural Overview

Sigma-MoE-Tiny employs a deep decoder-only Transformer backbone with $L=56$ layers, a hidden size of $d=1536$ , Group Query Attention (GQA) with 16 heads, QK-Norm, and RMSNorm pre-normalization. Each Transformer block incorporates an MoE module with $N_E=96$ small, two-layer SwiGLU feed-forward network (FFN) experts. For each input token position and layer, only a single expert ( $k=1$ , "top-1 MoE") is activated, achieving a total-to-activated parameter ratio of 40:1. The total model contains 20 billion parameters, with only 0.5 billion parameters active per token.

Gating is conducted via a learned affine transformation producing logits $\mathbf{g}(h_j) \in \mathbb{R}^{96}$ for each token hidden state $h_j$ , normalized with softmax (temperature $T_g=1$ by default). The expert $i_j^* = \arg \max_{i} p_{i,j}$ is chosen based on highest gating probability and solely processes token $j$ , ensuring hard routing at the token level (Hu et al., 18 Dec 2025).

2. Extreme-Sparsity and Load-Balancing Challenges

Load balancing between experts is a recognized challenge in sparse MoE regimes, particularly under extreme sparsity (e.g., $k=1$ of 96). The conventional Load-Balancing Loss (LBL),

$\mathrm{LBL} = \frac{1}{N_E} \sum_{i=1}^{N_E} f_i p_i,$

with $f_i$ denoting the fraction of tokens routed to expert $i$ and $p_i$ the average gating probability for expert $i$ , is minimized when either $f_i$ or $p_i$ are uniform over experts. However, in the lowest layers, this loss becomes ineffective due to "routing collapse": softmax uniformity is achieved, but actual routing ( $f_i$ ) is highly imbalanced, with empirical observations of minimum-loaded experts receiving ≈0% of tokens and the most loaded over 200% of the ideal share (Hu et al., 18 Dec 2025).

Auxiliary-loss-free bias methods and stricter balance objectives can worsen collapse under 1-of-96 sparsity. A variant optimizing $\|\mathbf{f}\|^2$ using a softmax-proxy temperature improves balance marginally, but aggressive constraints may impair expert specialization, underscoring an open trade-off (Hu et al., 18 Dec 2025).

3. Progressive Sparsification and Training Stabilization

To address the loss of effective load balancing during early training and prevent irrecoverable expert collapse, Sigma-MoE-Tiny implements a two-phase progressive sparsification schedule in the number of activated experts per layer $k_l(t)$ : $k_l(\tau) = \begin{cases} s_l, & \tau \leq 0.9 \ 1, & \tau > 0.9 \end{cases}$ with $\tau = t/T$ (fraction of total training tokens), and $s_l$ for the first eight layers set to $[8,8,6,6,4,4,2,2]$ , while layers 9–56 always use $k_l=1$ (Hu et al., 18 Dec 2025). For the initial 90% of tokens, multiple experts are active per layer in early layers, affording LBL a genuine routing space, and ensuring nontrivial load distributions. Upon transitioning to full sparsity ( $k=1$ ), there is a modest (~25%) drop in activated parameters, but no loss spikes observed. This schedule is critical for expert diversity and model stability.

4. Training and Post-Training Protocols

Pre-training utilizes a mix of high-quality public (Nemotron-CC, deduplicated DCLM, Fine Web-Edu) and proprietary synthetic corpora, spanning general knowledge, mathematics, and coding domains. The optimizer is AdamW ( $\beta_1=0.9$ , $\beta_2=0.95$ , $\epsilon=10^{-9}$ ), with weight decay 0.1 and gradient clipping at 1.0. Training runs on NVIDIA A100/40GB infrastructure with 4-way tensor parallelism, 96-way expert parallelism, and a micro-batch size of 8 (Hu et al., 18 Dec 2025).

Learning rate scheduling comprises warmup to $2.6 \times 10^{-4}$ , constant for 60% of tokens, cosine decay to $1.6 \times 10^{-4}$ , and linear decay to $2.6 \times 10^{-5}$ over the final 10%. Batch size increases from 1920 to 7680 in the first 40% of tokens and is then held constant. The global-batch LBL coefficient is $\lambda = 10^{-3}$ . Initialization uses $\mathcal{N}(0, 0.02)$ . Post-training entails four curricular stages extending context window from 4K to 128K tokens with a mix of Short-CoT and Long-CoT data, domain balancing (math:code:science:other = 3.5:3.5:2:1), and staged learning rates.

5. Empirical Results and Comparative Benchmarks

Sigma-MoE-Tiny demonstrates strong empirical performance relative to both dense and MoE baselines with a significantly smaller active parameter footprint. Selected evaluation metrics for pre-training (number denote activated/total parameters):

Model	MMLU (EM)	BBH (3-shot)	GSM8K (8-shot)	HumanEval (0-shot)
Qwen3-0.6B (0.6/0.6B)	58.30	44.10	41.10	29.90
Gemma-3-4B (4/4B)	59.51	51.70	43.97	35.98
DeepSeek-V2-Lite (2.4/15.7B)	52.81	41.47	59.59	29.27
Sigma-MoE-Tiny (0.5/20B)	64.81	63.23	71.65	42.07

Post-training, Sigma-MoE-Tiny achieves state-of-the-art or near state-of-the-art results against 7–8B parameter dense baselines and other leading MoE models:

Model	MMLU-Redux	MMLU-Pro	GPQA-Diamond avg@8	MATH-500	AIME’24/25	HumanEval	LiveCodeBench
DeepSeek-R1-Distill-Qwen-7B	—	—	—	—	—	—	—
DeepSeek-R1-Distill-Llama-8B	—	—	—	—	—	—	—
Qwen3-1.7B	—	—	—	—	—	—	—
Phi-3.5-MoE	78.6%	—	47.1%	—	—	—	42.5%
Sigma-MoE-Tiny (0.5/20B)	79.8%	63.7%	46.4%	94.6%	65.4/48.8%	79.9%	42.2%

A key outcome is that with only 0.5B parameters activated per token, Sigma-MoE-Tiny achieves or surpasses the performance of models with several times its active parameter count, including on reasoning and code generation tasks (Hu et al., 18 Dec 2025).

6. Load-Balancing Insights and Methodological Recommendations

Sigma-MoE-Tiny establishes that training stability and expert balance under $1$-of- $N_E$ sparsity can be realized by a synergy of GQA, QK-Norm, FP32 gating, and RMSNorm, combined with global-batch LBL and progressive sparsification. Conventional LBL alone can be "cheated" by pushing softmax probabilities $p_i$ toward uniformity without actual load balancing in active routing ( $f_i$ ). The progressive increase then reduction in active experts produces a genuine load distribution during the early phase, allowing LBL and optimizer dynamics to seed diversity before switching to strict sparsity.

Transitioning to target sparsity in the final 10% of tokens induces only a negligible performance drop (∼0.2% in MMLU). However, perfectly balanced routing via strictly enforced LBL may impede expert specialization; methods that directly regularize $f_i$ (active token allocations) without mere proxying through $p_i$ are indicated as promising future lines of investigation (Hu et al., 18 Dec 2025).

Limitations include the default focus on decoder-only architectures, and potential for extension to adaptive layer-wise sparsity or richer gating mechanisms. Auxiliary-loss-free explorations and stochastic expert dropout remain open for further study.

7. Context and Future Prospects

Sigma-MoE-Tiny stands as a technical milestone within the MoE paradigm, pushing mixture-of-experts sparsity and parameter efficiency to limits not previously demonstrated in open-source LLMs. It combines robust empirical performance, high throughput, and resilience against load-balancing collapse by explicitly integrating architectural, optimization, and scheduling innovations. Key future directions include loss formulations targeting actual routing fractions, adaptive sparsity, and mechanistic diversity in the gating network.

Sigma-MoE-Tiny and related parameter-efficient MoE approaches provide evidence that large-scale foundation models need not compromise on performance or stability when leveraging extreme sparsity, provided the methodological challenges associated with expert balancing are systematically addressed (Hu et al., 18 Dec 2025, Zadouri et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Sigma-Moe-Tiny Technical Report (2025)

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sigma-MoE-Tiny Model.