Switch Transformer: Sparse MoE Scaling

Updated 11 February 2026

Switch Transformer is a sparsely-activated Mixture-of-Experts architecture that routes each token to a single expert using a top-1 selection rule.
It employs a simplified routing mechanism with an auxiliary load-balancing loss to ensure efficient and uniform expert utilization.
Empirical evaluations demonstrate up to 7× pre-training speedup and robust multilingual scaling across models with billions to trillions of parameters.

The Switch Transformer is a sparsely-activated Mixture-of-Experts (MoE) variant of the standard Transformer architecture, characterized by a simple top-1 expert routing rule that enables efficient scaling to models with orders of magnitude more parameters—ranging from billions to trillions—without increasing per-token computational cost. By replacing every feed-forward sublayer (FFN) with a Switch-FFN MoE layer, where each token is routed to a single expert, the architecture maintains the same per-token FLOPs as dense Transformers while exploiting massive model sparsity. This design achieves up to 7× improvements in pre-training efficiency under matched computational resources and demonstrates stable training, low inter-device communication overhead, and applicability to multilingual pre-training across over 100 languages (Fedus et al., 2021).

1. Architectural Design and Main Components

The overall structure of the Switch Transformer mirrors that of a standard Transformer, with the exception that every FFN sublayer is replaced by a sparse Switch-FFN. The canonical Transformer block comprises multi-head self-attention, residual connections with layer normalization, a dense FFN (typically realized as $h = \mathrm{ReLU}(x W_1 + b_1) W_2 + b_2$ ), and a final residual/normalization step. In the Switch Transformer, the FFN is replaced with a layer composed of $E$ experts. Each token embedding $x \in \mathbb{R}^d$ is routed to exactly one expert for processing:

$y = E_{k(x)}(x)$

where $k(x) \in \{1, \ldots, E\}$ indicates the selected expert for token $x$ . Each expert maintains its own parameters (including $W_1^i, W_2^i$ ), which are physically partitioned across devices, permitting unconditional growth of the parameter count (by increasing $E$ ) without increasing per-device computation or memory consumption.

2. Routing Mechanism and Load Balancing

Switch Transformers employ a simplified routing algorithm, distinguishing them from earlier MoE approaches. The router computes the logits for each expert by projecting the token:

$h(x) = W_r x \in \mathbb{R}^E$

A softmax converts these logits to probabilities $p_i(x)$ over the $E$ experts:

$p_i(x) = \frac{\exp(h_i)}{\sum_{j=1}^E \exp(h_j)}$

For each token, only the expert with the highest score is chosen:

$k(x) = \arg\max_i p_i(x)$

The forward pass routes $x$ to expert $k(x)$ . To prevent overloaded experts and maintain uniform utilization, a differentiable load-balancing auxiliary loss, $L_\mathrm{aux}$ , is introduced:

$L_\mathrm{aux} = \alpha E \sum_{i=1}^{E} f_i P_i$

Here, $f_i$ represents the fraction of tokens assigned to expert $i$ in a batch, $P_i$ is the expected fraction under the softmax, and $\alpha \approx 10^{-2}$ . This mechanism ensures both actual and expected token distributions over experts approximate uniformity.

3. Computational and Communication Analysis

A comparison between dense and Switch Transformer computational profiles is summarized below:

Model Type	FLOPs per Token	Communication Overhead
Dense (T5-like, per token)	$2d \cdot d_\mathrm{ff}$	Single All-Reduce of $O(\mathrm{batch} \cdot d)$
Switch Transformer (per token)	$d \cdot d_\mathrm{ff}$	Two All-to-Alls of $(\mathrm{batch}/E) \times d$

Where $d_\mathrm{ff} \sim 4d$ in T5. Gating costs $O(E \cdot d)$ but is negligible relative to FFN computation. In practice, empirical evaluation on 32 TPUv3 cores indicates that Switch Transformer models only add $\sim10$ – $20\%$ communication overhead but deliver $2$– $7\times$ real-world speedup compared to parameter-matched dense models. The sparse activation property ensures that FLOPs per token remain constant as $E$ increases, unlocking a new scaling axis for model size.

4. Training Stability and Implementation Strategies

The Switch Transformer incorporates several strategies to address MoE training instabilities:

Selective Precision: Computations and storage primarily use bfloat16 for efficiency, except router logits, which are locally cast to float32 before softmax for stable probability computation.
Downscaled Initialization: All weights are initialized from a truncated normal $W_{jk} \sim \mathrm{TruncNormal}(0, \sigma^2)$ with $\sigma = \sqrt{s/\mathrm{fan_{in}}}$ and $s = 0.1$ (i.e., 10% of typical scale), significantly reducing early-stage variance.
Expert Capacity Constraint: Each expert can process up to $C = (\mathrm{batch}/E) \times \mathrm{capacity\_factor}$ tokens. Overflowing tokens are omitted (residual connection). Using $\mathrm{capacity\_factor} \sim 1.0$ –$1.25$ keeps overflow below $1\%$ .
Dropout Differentiation: While standard layers utilize a dropout rate of $\sim0.1$ , the expert FFNs employ $\sim0.4$ dropout during fine-tuning, particularly to mitigate overfitting on small datasets.

These techniques collectively yield stable training dynamics, even for trillion-parameter models.

5. Empirical Model Scaling and Speedup

All models were benchmarked under matched FLOPs/sec and pre-training objectives using 32 TPUv3 cores:

Model	Params	Pretrain FLOPs/seq	Ex/sec	Speedup vs. Dense Counterpart
T5-Base	223M	124B	1600	–
Switch-Base	3.8B	124B	1000	$\sim7\times$ faster to same perplexity
T5-Large	739M	425B	470	–
Switch-Large	7.4B	425B	330	$\sim2.5\times$ faster
T5-XXL	11B	6.3T	200	–
Switch-XXL	395B	6.3T	110	$\sim4\times$ faster
Switch-C (colossal)	1.6T	890B (experts only)	200	$\sim4\times$ over T5-XXL

Each Switch model reaches the negative log perplexity of its dense counterpart in substantially fewer wall-clock steps and at greater parameter count.

6. Multilingual Application and Scaling

The Switch Transformer architecture extends effectively to multilingual learning. Utilizing the mC4 corpus covering 101 languages (107 tasks), mSwitch-Base (128 experts, same FLOPs as mT5-Base) was benchmarked against mT5-Base. After $10^6$ training steps, mSwitch-Base attained lower negative log-likelihood (NLL) than mT5-Base in all 101 languages. A detailed analysis shows a mean speedup of $5\times$ , and 91% of languages attained at least $4\times$ faster convergence to the mT5-Base perplexity.

7. Significance and Scaling Implications

By routing each token to a single expert, the Switch Transformer eliminates the overhead associated with conventional MoEs, including the $k$ -expert overhead and elevated per-expert batch size requirements. The simplicity of the top-1 routing rule makes the gating computation trivial relative to the FFN, ensuring the total FLOPs per token are unaffected by increases in expert count. Consequently, Switch Transformers enable a $10\times$ – $100\times$ growth in parameter count without altering per-token costs. Key points include:

Robust, low-communication scaling to models with trillions of parameters.
High computational efficiency: $2$– $7\times$ faster pre-training at constant FLOPs per token.
Effective load balancing and stability strategies essential for deep sparse architectures.

A plausible implication is that the Switch Transformer's design introduces a new "parameter axis" of scaling in neural LLMs, permitting vastly increased capacity without prohibitive resource demands (Fedus et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Switch Transformers.