SwiGLU Activation Function in Transformers
- SwiGLU activation function is a neural mechanism that combines the Swish nonlinearity with gated linear units to enable flexible, smooth information flow in Transformer feed-forward networks.
- It improves model performance by enhancing gradient propagation and outperforms traditional activations like ReLU and GELU in benchmark evaluations.
- The Smooth-SwiGLU extension mitigates FP8 instability by rescaling activation spikes, ensuring robust low-precision training in large-scale language models.
The SwiGLU (Swish-Gated Linear Unit) activation function is a neural activation mechanism designed for Transformer feed-forward networks, combining the gating principle of Gated Linear Units (GLUs) with the smooth nonlinearity of the Swish function. Its integration into large-scale Transformer LLMs, notably LLaMA and PaLM, has produced consistent improvements over traditional activations such as ReLU and GELU. SwiGLU’s smooth gating not only enables more flexible information flow but also preserves gradient propagation. However, the advent of FP8-precision trillion-token training regimes revealed an instability intrinsic to SwiGLU, prompting the development of a per-channel rescaling fix known as Smooth-SwiGLU.
1. Mathematical Definition
SwiGLU is defined as follows. Let denote the input to a transformer MLP, and consider two projections: Here, . The Swish function with slope is defined as: where is the sigmoid function. In nearly all practical deployments, .
The core SwiGLU nonlinearity is the elementwise product: or, for a single neuron,
where denotes the Hadamard product.
The resulting feed-forward sublayer for a Transformer is
where .
To ensure parameter and FLOP parity with the standard two-matrix FFN ( hidden dimension), set .
2. Comparison to Conventional Activations
SwiGLU is structurally related to several FFN activation schemes—namely ReLU, GELU, and GLU—as summarized in the table:
| Activation | Formula | Gating/Nonlinearity |
|---|---|---|
| ReLU | Piecewise linear | |
| GELU | Gaussian CDF | |
| GLU | Sigmoid gating | |
| SwiGLU | Swish gating |
Unlike GLU, which uses a sigmoid activation for gating, SwiGLU replaces the sigmoid with Swish, yielding a gate with smooth, nonzero derivatives everywhere, thereby mitigating dead zone problems present in hard-saturating gates. SwiGLU retains the architecture’s learnable capacity, while improving expressiveness and gradient flow compared to ReLU and GELU (Shazeer, 2020).
3. Integration into the Transformer Architecture
In Transformer architectures, SwiGLU replaces the standard FFN sublayer:
- Standard FFN:
- SwiGLU FFN:
A typical dimensional choice for the T5 base model is , , , , representing a reduction from in standard FFN to ensure the FLOP and parameter count remains unchanged.
The gating mechanism modulates one linear stream by the Swish output of another, introducing smooth, data-dependent control over the information flow (Shazeer, 2020).
4. Empirical Performance and Observed Benefits
SwiGLU delivers consistent improvements across pre-training and downstream benchmark tasks. Key empirical results from (Shazeer, 2020):
- Pre-training perplexity (lower is better) at 524K steps:
- ReLU: 1.677
- GELU: 1.679
- GLU: 1.663
- GEGLU: 1.633 (best)
- SwiGLU: 1.636
- GLUE dev set average (higher is better):
- ReLU: 83.80; GELU: 83.86; GLU: 84.20; SwiGLU: 84.36; GEGLU: 84.12
- SuperGLUE dev set average:
- ReLU: 72.76; GELU: 72.98; GLU: 73.95; SwiGLU: 74.56; GEGLU: 73.96
- SQuAD v1.1 dev EM/F1:
- ReLU: 83.18/90.87; SwiGLU: 83.42/91.03
SwiGLU consistently outperforms ReLU and GELU, and performs comparably to GEGLU. On SuperGLUE, SwiGLU produced the single best variant within the set of tested gating activations.
Reported training regimes used Adafactor with inverse square root learning rate schedule and omitted bias terms (aligning with the T5 codebase) (Shazeer, 2020).
5. Numerical Stability and the Outlier Amplification Problem in Low Precision
When extending LLM training into FP8 precision and trillion-token regimes, previously unobserved instabilities arise with SwiGLU (Fishman et al., 2024). These are rooted in an “outlier amplification” effect specific to SwiGLU:
- The quadratic branch can, over long training spans, produce extreme activation spikes.
- These outliers exhaust the limited dynamic range of FP8 quantization, resulting in gradient overflows and catastrophic divergence, manifesting after several hundred billion tokens of training—an effect absent in shorter BF16 or higher-precision runs.
The source of these activation spikes is traced to a weight alignment phenomenon: This aligns the SwiGLU output toward , which for large behaves quadratically. This effect was empirically validated: the correlation approaches unity and channel norms explode prior to loss divergence in FP8 LLMs (Fishman et al., 2024).
6. Smooth-SwiGLU: Stabilizing FP8 Training
A channel-wise rescaling modification—Smooth-SwiGLU—ensures FP8 stability without altering the effective function:
where denotes FP8 quantization and is the per-channel maximum absolute value of the quadratic branch, computed on a calibration batch. This technique “squeezes” large activations into the FP8 representable range and restores the original scale after the subsequent linear transformation (Fishman et al., 2024).
Theoretical justification derives from the invariance of the end-to-end mapping; all scaling is internal to the computation graph during training, with no effect on inference.
Empirically, Smooth-SwiGLU enables FP8 models to maintain stability and match BF16 baselines even in multi-trillion-token runs. Throughput improvements of up to 33.5% over BF16 and a 30% reduction in optimizer memory were observed, with zero-shot accuracies and perplexities within 0.3% of their BF16-trained counterparts (Fishman et al., 2024).
7. Practical Implementation Considerations
Key practical notes for deploying SwiGLU-based Transformer FFNs:
- Omit biases on to match common implementations (e.g., T5).
- Maintain constant compute and parameter budgets by reducing hidden size according to .
- No additional hyperparameters are introduced; standard Swish suffices.
- Pre-training does not require dropout for optimal results in some configurations; this independently improves outcomes.
- SwiGLU introduces one extra matrix multiplication in the first FFN stage, but the reduced hidden size makes overall FLOPs and memory usage equivalent to standard 2-matrix FFNs.
- The smooth gating of Swish produces nonzero gradients at the origin, mitigating “dead ReLU” phenomena.
- No special initialization is required beyond Xavier/Glorot.
With Smooth-SwiGLU, the FP8 failure mode is fully mitigated, and all intended computational advantages of quantized training regimes are unlocked (Shazeer, 2020, Fishman et al., 2024).