Papers
Topics
Authors
Recent
Search
2000 character limit reached

Switch Transformer: Sparse MoE Scaling

Updated 11 February 2026
  • Switch Transformer is a sparsely-activated Mixture-of-Experts architecture that routes each token to a single expert using a top-1 selection rule.
  • It employs a simplified routing mechanism with an auxiliary load-balancing loss to ensure efficient and uniform expert utilization.
  • Empirical evaluations demonstrate up to 7× pre-training speedup and robust multilingual scaling across models with billions to trillions of parameters.

The Switch Transformer is a sparsely-activated Mixture-of-Experts (MoE) variant of the standard Transformer architecture, characterized by a simple top-1 expert routing rule that enables efficient scaling to models with orders of magnitude more parameters—ranging from billions to trillions—without increasing per-token computational cost. By replacing every feed-forward sublayer (FFN) with a Switch-FFN MoE layer, where each token is routed to a single expert, the architecture maintains the same per-token FLOPs as dense Transformers while exploiting massive model sparsity. This design achieves up to 7× improvements in pre-training efficiency under matched computational resources and demonstrates stable training, low inter-device communication overhead, and applicability to multilingual pre-training across over 100 languages (Fedus et al., 2021).

1. Architectural Design and Main Components

The overall structure of the Switch Transformer mirrors that of a standard Transformer, with the exception that every FFN sublayer is replaced by a sparse Switch-FFN. The canonical Transformer block comprises multi-head self-attention, residual connections with layer normalization, a dense FFN (typically realized as h=ReLU(xW1+b1)W2+b2h = \mathrm{ReLU}(x W_1 + b_1) W_2 + b_2), and a final residual/normalization step. In the Switch Transformer, the FFN is replaced with a layer composed of EE experts. Each token embedding xRdx \in \mathbb{R}^d is routed to exactly one expert for processing:

y=Ek(x)(x)y = E_{k(x)}(x)

where k(x){1,,E}k(x) \in \{1, \ldots, E\} indicates the selected expert for token xx. Each expert maintains its own parameters (including W1i,W2iW_1^i, W_2^i), which are physically partitioned across devices, permitting unconditional growth of the parameter count (by increasing EE) without increasing per-device computation or memory consumption.

2. Routing Mechanism and Load Balancing

Switch Transformers employ a simplified routing algorithm, distinguishing them from earlier MoE approaches. The router computes the logits for each expert by projecting the token:

h(x)=WrxREh(x) = W_r x \in \mathbb{R}^E

A softmax converts these logits to probabilities pi(x)p_i(x) over the EE experts:

pi(x)=exp(hi)j=1Eexp(hj)p_i(x) = \frac{\exp(h_i)}{\sum_{j=1}^E \exp(h_j)}

For each token, only the expert with the highest score is chosen:

k(x)=argmaxipi(x)k(x) = \arg\max_i p_i(x)

The forward pass routes xx to expert k(x)k(x). To prevent overloaded experts and maintain uniform utilization, a differentiable load-balancing auxiliary loss, LauxL_\mathrm{aux}, is introduced:

Laux=αEi=1EfiPiL_\mathrm{aux} = \alpha E \sum_{i=1}^{E} f_i P_i

Here, fif_i represents the fraction of tokens assigned to expert ii in a batch, PiP_i is the expected fraction under the softmax, and α102\alpha \approx 10^{-2}. This mechanism ensures both actual and expected token distributions over experts approximate uniformity.

3. Computational and Communication Analysis

A comparison between dense and Switch Transformer computational profiles is summarized below:

Model Type FLOPs per Token Communication Overhead
Dense (T5-like, per token) 2ddff2d \cdot d_\mathrm{ff} Single All-Reduce of O(batchd)O(\mathrm{batch} \cdot d)
Switch Transformer (per token) ddffd \cdot d_\mathrm{ff} Two All-to-Alls of (batch/E)×d(\mathrm{batch}/E) \times d

Where dff4dd_\mathrm{ff} \sim 4d in T5. Gating costs O(Ed)O(E \cdot d) but is negligible relative to FFN computation. In practice, empirical evaluation on 32 TPUv3 cores indicates that Switch Transformer models only add 10\sim1020%20\% communication overhead but deliver $2$–7×7\times real-world speedup compared to parameter-matched dense models. The sparse activation property ensures that FLOPs per token remain constant as EE increases, unlocking a new scaling axis for model size.

4. Training Stability and Implementation Strategies

The Switch Transformer incorporates several strategies to address MoE training instabilities:

  • Selective Precision: Computations and storage primarily use bfloat16 for efficiency, except router logits, which are locally cast to float32 before softmax for stable probability computation.
  • Downscaled Initialization: All weights are initialized from a truncated normal WjkTruncNormal(0,σ2)W_{jk} \sim \mathrm{TruncNormal}(0, \sigma^2) with σ=s/fanin\sigma = \sqrt{s/\mathrm{fan_{in}}} and s=0.1s = 0.1 (i.e., 10% of typical scale), significantly reducing early-stage variance.
  • Expert Capacity Constraint: Each expert can process up to C=(batch/E)×capacity_factorC = (\mathrm{batch}/E) \times \mathrm{capacity\_factor} tokens. Overflowing tokens are omitted (residual connection). Using capacity_factor1.0\mathrm{capacity\_factor} \sim 1.0–$1.25$ keeps overflow below 1%1\%.
  • Dropout Differentiation: While standard layers utilize a dropout rate of 0.1\sim0.1, the expert FFNs employ 0.4\sim0.4 dropout during fine-tuning, particularly to mitigate overfitting on small datasets.

These techniques collectively yield stable training dynamics, even for trillion-parameter models.

5. Empirical Model Scaling and Speedup

All models were benchmarked under matched FLOPs/sec and pre-training objectives using 32 TPUv3 cores:

Model Params Pretrain FLOPs/seq Ex/sec Speedup vs. Dense Counterpart
T5-Base 223M 124B 1600
Switch-Base 3.8B 124B 1000 7×\sim7\times faster to same perplexity
T5-Large 739M 425B 470
Switch-Large 7.4B 425B 330 2.5×\sim2.5\times faster
T5-XXL 11B 6.3T 200
Switch-XXL 395B 6.3T 110 4×\sim4\times faster
Switch-C (colossal) 1.6T 890B (experts only) 200 4×\sim4\times over T5-XXL

Each Switch model reaches the negative log perplexity of its dense counterpart in substantially fewer wall-clock steps and at greater parameter count.

6. Multilingual Application and Scaling

The Switch Transformer architecture extends effectively to multilingual learning. Utilizing the mC4 corpus covering 101 languages (107 tasks), mSwitch-Base (128 experts, same FLOPs as mT5-Base) was benchmarked against mT5-Base. After 10610^6 training steps, mSwitch-Base attained lower negative log-likelihood (NLL) than mT5-Base in all 101 languages. A detailed analysis shows a mean speedup of 5×5\times, and 91% of languages attained at least 4×4\times faster convergence to the mT5-Base perplexity.

7. Significance and Scaling Implications

By routing each token to a single expert, the Switch Transformer eliminates the overhead associated with conventional MoEs, including the kk-expert overhead and elevated per-expert batch size requirements. The simplicity of the top-1 routing rule makes the gating computation trivial relative to the FFN, ensuring the total FLOPs per token are unaffected by increases in expert count. Consequently, Switch Transformers enable a 10×10\times100×100\times growth in parameter count without altering per-token costs. Key points include:

  • Robust, low-communication scaling to models with trillions of parameters.
  • High computational efficiency: $2$–7×7\times faster pre-training at constant FLOPs per token.
  • Effective load balancing and stability strategies essential for deep sparse architectures.

A plausible implication is that the Switch Transformer's design introduces a new "parameter axis" of scaling in neural LLMs, permitting vastly increased capacity without prohibitive resource demands (Fedus et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Switch Transformers.