Papers
Topics
Authors
Recent
Search
2000 character limit reached

SwiGLU Activation Function in Transformers

Updated 24 February 2026
  • SwiGLU activation function is a neural mechanism that combines the Swish nonlinearity with gated linear units to enable flexible, smooth information flow in Transformer feed-forward networks.
  • It improves model performance by enhancing gradient propagation and outperforms traditional activations like ReLU and GELU in benchmark evaluations.
  • The Smooth-SwiGLU extension mitigates FP8 instability by rescaling activation spikes, ensuring robust low-precision training in large-scale language models.

The SwiGLU (Swish-Gated Linear Unit) activation function is a neural activation mechanism designed for Transformer feed-forward networks, combining the gating principle of Gated Linear Units (GLUs) with the smooth nonlinearity of the Swish function. Its integration into large-scale Transformer LLMs, notably LLaMA and PaLM, has produced consistent improvements over traditional activations such as ReLU and GELU. SwiGLU’s smooth gating not only enables more flexible information flow but also preserves gradient propagation. However, the advent of FP8-precision trillion-token training regimes revealed an instability intrinsic to SwiGLU, prompting the development of a per-channel rescaling fix known as Smooth-SwiGLU.

1. Mathematical Definition

SwiGLU is defined as follows. Let xRdmodelx\in\mathbb{R}^{d_{\mathrm{model}}} denote the input to a transformer MLP, and consider two projections: z1=xWRdff,z2=xVRdffz_1 = xW \in \mathbb{R}^{d'_{\mathrm{ff}}}, \qquad z_2 = xV \in \mathbb{R}^{d'_{\mathrm{ff}}} Here, W,VRdmodel×dffW, V \in \mathbb{R}^{d_{\mathrm{model}} \times d'_{\mathrm{ff}}}. The Swish function with slope β\beta is defined as: Swishβ(u)=uσ(βu)\mathrm{Swish}_\beta(u) = u \cdot \sigma(\beta u) where σ()\sigma(\cdot) is the sigmoid function. In nearly all practical deployments, β=1\beta=1.

The core SwiGLU nonlinearity is the elementwise product: SwiGLU(x)=Swish1(xW)    (xV)\mathrm{SwiGLU}(x) = \mathrm{Swish}_1(xW)\; \otimes \; (xV) or, for a single neuron,

SwiGLUw1,w2(x)=(xw1)Swish(xw2)=(xw1)(xw2)σ(xw2)\mathrm{SwiGLU}_{w_1, w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2) = (x^\top w_1)(x^\top w_2) \sigma(x^\top w_2)

where \otimes denotes the Hadamard product.

The resulting feed-forward sublayer for a Transformer is

FFNSwiGLU(x)=(Swish1(xW)xV)W2\mathrm{FFN}_{\mathrm{SwiGLU}}(x) = \left( \mathrm{Swish}_1(xW) \otimes xV \right) W_2

where W2Rdff×dmodelW_2 \in \mathbb{R}^{d'_{\mathrm{ff}} \times d_{\mathrm{model}}}.

To ensure parameter and FLOP parity with the standard two-matrix FFN (dffd_{\mathrm{ff}} hidden dimension), set dff=23dffd'_{\mathrm{ff}} = \lfloor \frac{2}{3} d_{\mathrm{ff}} \rfloor.

2. Comparison to Conventional Activations

SwiGLU is structurally related to several FFN activation schemes—namely ReLU, GELU, and GLU—as summarized in the table:

Activation Formula Gating/Nonlinearity
ReLU max(xW1,0)W2\max(xW_1, 0) W_2 Piecewise linear
GELU (xW1)Φ(xW1)W2(xW_1)\Phi(xW_1)W_2 Gaussian CDF Φ\Phi
GLU [σ(xW)xV]W2[\sigma(xW) \otimes xV] W_2 Sigmoid gating
SwiGLU [Swish1(xW)xV]W2[\mathrm{Swish}_1(xW) \otimes xV] W_2 Swish gating

Unlike GLU, which uses a sigmoid activation for gating, SwiGLU replaces the sigmoid with Swish, yielding a gate with smooth, nonzero derivatives everywhere, thereby mitigating dead zone problems present in hard-saturating gates. SwiGLU retains the architecture’s learnable capacity, while improving expressiveness and gradient flow compared to ReLU and GELU (Shazeer, 2020).

3. Integration into the Transformer Architecture

In Transformer architectures, SwiGLU replaces the standard FFN sublayer:

  • Standard FFN: xLinearActivationLinearoutputx \rightarrow \text{Linear} \rightarrow \text{Activation} \rightarrow \text{Linear} \rightarrow \text{output}
  • SwiGLU FFN: x{W,V projections in parallel}{Swish,Identity}Hadamard productW2outputx \rightarrow \{W, V \ \text{projections in parallel}\} \rightarrow \{\text{Swish}, \text{Identity}\} \rightarrow \text{Hadamard product} \rightarrow W_2 \rightarrow \text{output}

A typical dimensional choice for the T5 base model is xR768x \in \mathbb{R}^{768}, dff=2048d'_{\mathrm{ff}} = 2048, W,V:7682048W, V: 768 \rightarrow 2048, W2:2048768W_2: 2048 \rightarrow 768, representing a reduction from dff=3072d_{\mathrm{ff}} = 3072 in standard FFN to ensure the FLOP and parameter count remains unchanged.

The gating mechanism modulates one linear stream by the Swish output of another, introducing smooth, data-dependent control over the information flow (Shazeer, 2020).

4. Empirical Performance and Observed Benefits

SwiGLU delivers consistent improvements across pre-training and downstream benchmark tasks. Key empirical results from (Shazeer, 2020):

  • Pre-training perplexity (lower is better) at 524K steps:
    • ReLU: 1.677
    • GELU: 1.679
    • GLU: 1.663
    • GEGLU: 1.633 (best)
    • SwiGLU: 1.636
  • GLUE dev set average (higher is better):
    • ReLU: 83.80; GELU: 83.86; GLU: 84.20; SwiGLU: 84.36; GEGLU: 84.12
  • SuperGLUE dev set average:
    • ReLU: 72.76; GELU: 72.98; GLU: 73.95; SwiGLU: 74.56; GEGLU: 73.96
  • SQuAD v1.1 dev EM/F1:
    • ReLU: 83.18/90.87; SwiGLU: 83.42/91.03

SwiGLU consistently outperforms ReLU and GELU, and performs comparably to GEGLU. On SuperGLUE, SwiGLU produced the single best variant within the set of tested gating activations.

Reported training regimes used Adafactor with inverse square root learning rate schedule and omitted bias terms (aligning with the T5 codebase) (Shazeer, 2020).

5. Numerical Stability and the Outlier Amplification Problem in Low Precision

When extending LLM training into FP8 precision and trillion-token regimes, previously unobserved instabilities arise with SwiGLU (Fishman et al., 2024). These are rooted in an “outlier amplification” effect specific to SwiGLU:

  • The quadratic branch (xw1)(xw2)(x^\top w_1)(x^\top w_2) can, over long training spans, produce extreme activation spikes.
  • These outliers exhaust the limited dynamic range of FP8 quantization, resulting in gradient overflows and catastrophic divergence, manifesting after several hundred billion tokens of training—an effect absent in shorter BF16 or higher-precision runs.

The source of these activation spikes is traced to a weight alignment phenomenon: Upon sufficient training with 2 regularization:w1±w2\text{Upon sufficient training with } \ell_2 \text{ regularization:} \qquad w_1 \to \pm w_2 This aligns the SwiGLU output toward (xw)2σ(xw)(x^\top w)^2 \sigma(x^\top w), which for large xw|x^\top w| behaves quadratically. This effect was empirically validated: the correlation corr(w1,w2)\operatorname{corr}(w_1, w_2) approaches unity and channel norms explode prior to loss divergence in FP8 LLMs (Fishman et al., 2024).

6. Smooth-SwiGLU: Stabilizing FP8 Training

A channel-wise rescaling modification—Smooth-SwiGLU—ensures FP8 stability without altering the effective function:

SmoothSwiGLUw^1,i,w^2,i(x)=si1Q(si(w^1,iQ(x))Swish(w^2,iQ(x)))\mathrm{SmoothSwiGLU}_{\hat w_{1,i},\hat w_{2,i}}(x) = s_i^{-1} Q\left( s_i\, (\hat w_{1,i}^{\top} Q(x))\, \mathrm{Swish}(\hat w_{2,i}^{\top}Q(x)) \right)

where Q()Q(\cdot) denotes FP8 quantization and sis_i is the per-channel maximum absolute value of the quadratic branch, computed on a calibration batch. This technique “squeezes” large activations into the FP8 representable range and restores the original scale after the subsequent linear transformation (Fishman et al., 2024).

Theoretical justification derives from the invariance of the end-to-end mapping; all scaling is internal to the computation graph during training, with no effect on inference.

Empirically, Smooth-SwiGLU enables FP8 models to maintain stability and match BF16 baselines even in multi-trillion-token runs. Throughput improvements of up to 33.5% over BF16 and a 30% reduction in optimizer memory were observed, with zero-shot accuracies and perplexities within 0.3% of their BF16-trained counterparts (Fishman et al., 2024).

7. Practical Implementation Considerations

Key practical notes for deploying SwiGLU-based Transformer FFNs:

  • Omit biases on W,VW, V to match common implementations (e.g., T5).
  • Maintain constant compute and parameter budgets by reducing hidden size according to dff=23dffd'_{\mathrm{ff}} = \lfloor \frac{2}{3} d_{\mathrm{ff}} \rfloor.
  • No additional hyperparameters are introduced; standard Swish β=1\beta=1 suffices.
  • Pre-training does not require dropout for optimal results in some configurations; this independently improves outcomes.
  • SwiGLU introduces one extra matrix multiplication in the first FFN stage, but the reduced hidden size makes overall FLOPs and memory usage equivalent to standard 2-matrix FFNs.
  • The smooth gating of Swish produces nonzero gradients at the origin, mitigating “dead ReLU” phenomena.
  • No special initialization is required beyond Xavier/Glorot.

With Smooth-SwiGLU, the FP8 failure mode is fully mitigated, and all intended computational advantages of quantized training regimes are unlocked (Shazeer, 2020, Fishman et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SwiGLU Activation Function.