Papers
Topics
Authors
Recent
2000 character limit reached

SwiGLU Activation in Transformer Models

Updated 2 December 2025
  • SwiGLU activation is a gated nonlinearity that combines linear projections with the Swish function to enhance model expressivity in transformer feed-forward networks.
  • It delivers measurable performance gains in dense transformers and audio models by effectively improving gradient propagation and enabling dynamic input-dependent gating.
  • Its quadratic scaling under reduced-precision conditions poses optimization challenges, which modifications like Smooth-SwiGLU help mitigate to ensure stable training.

The SwiGLU (Swish Gated Linear Unit) activation is a neural network nonlinearity widely adopted in large-scale language and audio models for its empirical performance improvements over traditional activations such as GeLU and ReLU. Characterized by its gating structure and usage of the Swish nonlinearity, SwiGLU has demonstrated advantages in dense transformer architectures and has been central in both language modeling and cross-domain applications. However, its quadratic scaling behavior introduces distinctive optimization challenges under reduced-precision training, leading to the development of modifications for stable scaling.

1. Mathematical Definition and Core Properties

SwiGLU is a member of the gated linear unit (GLU) family, originally formulated to inject dynamic, input-dependent gating into feed-forward neural network (FFN) blocks. Let x∈Rdx \in \mathbb{R}^d denote the input. As implemented in LLMs and macaron transformers, the SwiGLU output is defined channel-wise as:

SwiGLUw1,w2(x)=(x⊤w1)⋅Swish(x⊤w2)\mathrm{SwiGLU}_{w_1,w_2}(x) = (x^\top w_1) \cdot \mathrm{Swish}(x^\top w_2)

where Swish(z)=z⋅σ(z)\mathrm{Swish}(z) = z \cdot \sigma(z) and σ(z)=1/(1+e−z)\sigma(z) = 1/(1+e^{-z}).

An equivalent vectorized form, commonly used in transformer FFNs, is: FFN(x)=Wout[Swish(xWg)⊙(xWv)]\mathrm{FFN}(x) = W_\mathrm{out}\left[\mathrm{Swish}(xW_g) \odot (xW_v)\right] with two separate projections ("gate" WgW_g and "value" WvW_v), followed by an output projection WoutW_\mathrm{out}. The product structure renders the activation quadratic when w1w_1 and w2w_2 become aligned, distinguishing SwiGLU from the linear growth of GeLU or ReLU at large input magnitudes (Fishman et al., 19 Sep 2024, Yadav et al., 14 Jul 2025, Zhang et al., 6 Feb 2024).

2. Empirical Benefits and Transformer Integration

Empirical studies have motivated widespread replacement of GeLU by SwiGLU in transformer-based large models due to measurable improvements in downstream metrics. In AudioMAE++, incorporation of SwiGLU in macaron-style transformer blocks yields a gain of 3–7 HEAR points on benchmarks relative to traditional GeLU FFNs, outperforming even up-scaled standard baseline architectures. The gating mechanism is posited to enhance gradient propagation and model expressivity by allowing dynamic fusion of two input projections, which appears to bolster sequence modeling and masked reconstruction (Yadav et al., 14 Jul 2025).

A typical usage within a modern residual block is:

  1. Pre-norm and half-residual through a standard FFN (often GeLU).
  2. Pre-norm and full-residual through Multi-Head Attention.
  3. Pre-norm and half-residual through a SwiGLU-activated FFN.
  4. Final layer normalization.

This "dual-FFN" macaron pattern, with SwiGLU as the final point-wise nonlinearity, produces strong scaling performance and robust representations (Yadav et al., 14 Jul 2025).

3. Outlier Amplification and Stability in Low-Precision Training

Under prolonged training with reduced-precision arithmetic (FP8), extended experiments reveal a previously latent path to catastrophic instability in SwiGLU networks. When training LLMs (e.g., Llama2-7B) on trillion-token corpora, models using standard SwiGLU activations in FP8 experience sudden loss divergence after several hundred billion tokens, while higher-precision (BF16) training remains stable.

Mechanistically, persistent ℓ2\ell_2-regularized optimization drives the "linear" and "gate" weight vectors (w1,w2)(w_1, w_2) toward mutual alignment over time. This alignment transforms the output scaling from linear to quadratic in xx, enabling sporadic, extremely large activations—"spikes"—that overflow FP8's dynamic range. Analytically, the stationary weight correlation satisfies w1≈±w2w_1 \approx \pm w_2, and the quadratic blowup is empirically confirmed by surging weight norms and correlation, as well as diverging channel activations (Fishman et al., 19 Sep 2024).

4. Smooth-SwiGLU: Stabilization for Trillion-Token FP8 Training

To remedy the quadratic outlier vulnerability unique to SwiGLU in FP8 regimes, Smooth-SwiGLU introduces a per-channel normalization scheme. Before quantization, the linear branch is rescaled to bound its dynamic range:

SmoothSwiGLUw^1,i,w^2,i(x)=si−1Q[si(w^1,i⊤Q(x))⋅Swish(w^2,i⊤Q(x))]\mathrm{SmoothSwiGLU}_{\hat{w}_1,i, \hat{w}_2,i}(x) = s_i^{-1} Q\left[s_i(\hat{w}_{1,i}^\top Q(x)) \cdot \mathrm{Swish}(\hat{w}_{2,i}^\top Q(x))\right]

where sis_i is the channel-wise maximum of the pre-activation ∣w1,i,jxj∣|w_{1,i,j} x_j| and Q(⋅)Q(\cdot) denotes FP8 quantization.

Rescaling factors are absorbed into adjacent linear layers during inference, ensuring exact functional equivalence to the original SwiGLU and zero additional overhead. Experimental results demonstrate stable convergence up to 2T tokens, full recovery of end-to-end FP8 speed and memory benefits, and parity in zero-shot performance with BF16 baselines (Fishman et al., 19 Sep 2024).

5. Sparse Regimes and Hardware-Driven Evaluations

SwiGLU's gating architecture is compatible with sparse inference but exhibits moderate sparsity and hardware affinity compared to alternatives such as ReLU2^2. When using output-magnitude-based sparsity definitions, LLMs with SwiGLU activations achieve ≈\approx80–85% neuron skipping at a Cumulative Error of Tail Truncation (CETT) threshold of 0.2, with <<1% drop in end-task accuracy up to ≈\approx75% sparsity.

However, at extreme sparsity (>>80%), SwiGLU's accuracy degrades more sharply than ReLU2^2, which supports ≈\approx90–95% sparsity for comparable error and attains superior predictor recall and memory-access patterns. SwiGLU models display only a ≈\approx7% token-level reuse ratio at 5%5\% average neuron activation, limiting hardware efficiency when deployed on offload/flash architectures. These findings suggest that while SwiGLU remains common in high-accuracy dense transformers, ReLU2^2 or related GLU variants offer superior efficiency in context-sensitive sparse LLM deployment (Zhang et al., 6 Feb 2024).

Activation Max Stable Sparsity (CETT=0.2) Hardware Reuse Ratio Accuracy Degradation @ 90% Sparsity
SwiGLU 80–85% ~7% 2–3%
ReLU2^2 90–95% ~30% 0.1–0.5%

6. Recommendations and Practical Usage

Best practices, as synthesized from empirical and analytic results, are as follows:

  • For dense, high-precision transformer training, SwiGLU offers strong absolute performance and benefits from dynamic gating.
  • For trillion-token FP8 or similarly aggressive low-precision training scenarios, per-channel scaling as in Smooth-SwiGLU is required to avoid quadratic divergence and ensure full throughput and stability.
  • In sparse inference or on hardware-constrained deployments, ReLU2^2-style GLU activations are preferable when maximal neuron skipping and hardware locality are priorities.
  • To leverage existing SwiGLU checkpoints for sparse regimes, adaptive activation thresholding and lightweight predictor networks can provide up to 80% sparsity with minimal impact on end-task metrics.

A plausible implication is that, while gated activations such as SwiGLU will persist in dense models prioritizing accuracy, architectural choices for next-generation sparse or low-precision models will likely favor variants with improved robustness to outlier activation and stronger hardware affinity.

7. Cross-Domain Applications and Scalability

SwiGLU has demonstrated favorable scaling characteristics in cross-modal tasks beyond language modeling. In masked audio modeling, AudioMAE++ integrates SwiGLU into macaron transformer blocks, with experimental results showing significant improvements in accuracy across diverse classification and speech benchmarks compared to conventional FFN activations. The general principle of gating via Swish nonlinearity—preserved in both transformer-based LLMs and audio encoders—highlights its broad applicability and the capacity for synergistic gains when combined with advanced architectural patterns (Yadav et al., 14 Jul 2025).

In summary, SwiGLU is a highly effective, gated activation function with proven utility in dense transformer optimization, but requires careful mitigation of its quadratic scaling risks for stable deployment in ultra-large-scale, low-precision training regimes. Modifications such as Smooth-SwiGLU and the competitive performance of alternative GLU variants critically inform activation selection for future large-scale, efficient neural architectures.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SwiGLU Activation.