SwiGLU: Swish-Gated Linear Unit

Updated 3 December 2025

SwiGLU is a feed-forward architectural building block that replaces standard FFNs in Transformers using a smooth Swish-based gating mechanism.
It combines dual linear projections with element-wise operations to improve convergence and boost performance across tasks like language modeling, audio analysis, and wireless signal processing.
Empirical results demonstrate improved optimization and efficiency by matching parameter counts through reduced hidden sizes while delivering robust downstream performance.

The Swish-Gated Linear Unit (SwiGLU) is a feed-forward architectural building block for neural networks, particularly prominent as a substitute for the standard multilayer perceptron (MLP) or FFN in Transformer models. It generalizes the original Gated Linear Unit (GLU) by replacing the sigmoid gating function with the non-monotonic, smooth Swish activation. SwiGLU has been empirically demonstrated to deliver improved optimization and downstream performance across diverse modalities including language, audio, and wireless signal processing, without increasing parameter or computational complexity compared to typical two-layer FFNs (Yadav et al., 14 Jul 2025, Shazeer, 2020, Masrur et al., 14 Jan 2025).

1. Formal Definition and Mathematical Formulation

The SwiGLU block combines two linear projections and applies element-wise gating via the Swish activation function, followed by a final linear recombination:

Let $x \in \mathbb{R}^d$ be the FFN input, $W, V \in \mathbb{R}^{d \times d_{ff}}$ be expansion matrices, and $O \in \mathbb{R}^{d_{ff} \times d}$ be the output projection. The operation is

$F_{\mathrm{SwiGLU}}(x) = \left(\mathrm{Swish}(xW) \odot (xV)\right) O$

with $\odot$ denoting element-wise (Hadamard) product. The Swish activation is

$\mathrm{Swish}(u) = u \cdot \sigma(u) = \frac{u}{1 + e^{-u}}$

This construction merges activation and gating into one smoother, non-monotonic function in contrast to prior Gated Linear Unit variants such as:

GLU: $\left(xW \odot \sigma(xV)\right) O$
GeGLU: $\left(\mathrm{GELU}(xW) \odot (xV)\right) O$ (Shazeer, 2020)

The feed-forward sublayer in Transformer architectures thus replaces the canonical two-layer GELU-FFN (or ReLU-FFN):

$F_{\mathrm{GELU}}(x) = \mathrm{GELU}(xW_1 + b_1) W_2 + b_2$

with the SwiGLU transformation. An equivalent formulation computes a joint projection into $2d_{ff}$ , splits it, applies Swish to one half, and gates the other (Shazeer, 2020).

2. Integration with Transformer Architectures

SwiGLU functions as a drop-in alternative to the standard FFN in the Transformer block. In contemporary implementations:

AudioMAE++ employs SwiGLU inside macaron-style blocks, replacing the traditional two-layer GELU-FFN with a single SwiGLU FFN (Yadav et al., 14 Jul 2025).
The L-SwiGLU Transformer architecture integrates SwiGLU after multi-head attention, paired with RMSNorm, and omits positional embeddings and class tokens, utilizing global average pooling for output aggregation (Masrur et al., 14 Jan 2025).
Canonical layernorm, residual connections, and block ordering remain unchanged unless otherwise specified.

A typical encoder block with SwiGLU in the L-SwiGLU variant is structured as:

Input $\rightarrow$ RMSNorm $\rightarrow$ MHA $+$ residual $\rightarrow$ RMSNorm $\rightarrow$ SwiGLU $+$ residual $\rightarrow$ Output

3. Computational Complexity and Parameters

The SwiGLU block uses three learned matrices ( $W$ , $V$ , $O$ ) versus two in the standard FFN, but with a reduced hidden/intermediate size—typically by a factor of $2/3$—so total parameter count and floating-point operation (FLOP) cost are matched to that of the two-layer FFN (Shazeer, 2020, Masrur et al., 14 Jan 2025). Specifically, for input/output dimension $d$ and hidden size $h$ :

Architecture	Parameter Count	Per-token FLOPs
Vanilla FFN	$2dh$	$4dh$
SwiGLU	$3dh_s$ ( $h_s \approx 2h/3$ )	$6dh_s$

This adjustment ensures computational parity and supports direct replacement in existing architectures.

4. Empirical Results and Benchmarks

Empirical evaluations consistently show that SwiGLU delivers modest but measurable improvements in both pre-training and fine-tuning settings:

In T5-style Transformer models, SwiGLU achieved held-out English text log-perplexity of $1.944$ (vs. GELU $1.679$, GeGLU $1.633$), and GLUE/SuperGLUE scores of $84.36/74.56$, outperforming ReLU, GELU, and Swish-activated FFN variants (Shazeer, 2020).
In AudioMAE++, swapping GELU-FFN for SwiGLU in macaron blocks improved the aggregated HEAR score by $\sim3.8$ points (e.g., $91.8$ for AudioMAE++ Base vs. $63.0$ for AudioMAE baseline), with consistent improvement across 10 audio classification and speech benchmarks (Yadav et al., 14 Jul 2025).
In indoor localization with distributed wireless sensors, the L-SwiGLU ViT yields an 8.51% reduction in 90th-percentile 2D error over vanilla MLP Transformer (from 0.388 m to 0.355 m) and outperforms a 14.1 $\times$ larger vanilla model by 46.13% in the same metric (Masrur et al., 14 Jan 2025).

These findings are robust across task domains (language, audio, signal processing) and model scales.

5. Functional and Theoretical Motivation

SwiGLU’s core theoretical appeal arises from the per-dimension adaptive gating conferred by the GLU structure, with the Swish function yielding several advantages over classical sigmoid or GELU gates:

The non-monotonic Swish gate permits non-vanishing gradients for negative inputs, unlike sigmoid, supporting better flow of information and gradient during backpropagation (Yadav et al., 14 Jul 2025).
Per-dimension control: each hidden feature can be selectively amplified or suppressed, leading to more expressive subspace projections, shown valuable for reconstructing highly variable spectrogram patches or for filtering informative paths in noisy wireless signal scenarios (Yadav et al., 14 Jul 2025, Masrur et al., 14 Jan 2025).
Empirical studies suggest Swish gating improves convergence relative to sigmoid or plain ReLU/GELU nonlinearities (Yadav et al., 14 Jul 2025, Shazeer, 2020). No explicit analysis of gradient norm or dynamic range is available in the literature.

6. Implementation Details

SwiGLU transformations typically initialize $W$ , $V$ , and $O$ using the same Xavier scheme as the surrounding Transformer layers. No auxiliary scaling or per-gate bias parameters are required; all gating and output mixing are subsumed within the three linear transforms. The block admits efficient batched implementation due to its reliance on matrix multiplications and componentwise operations (Yadav et al., 14 Jul 2025, Shazeer, 2020).

A direct comparison among gating-unit FFN variants is summarized below:

Variant	Gating Function	Empirical Outcome
GLU	Sigmoid	Stronger than ReLU/GELU
GeGLU	GELU	Slightly better than GLU
SwiGLU	Swish	Equal or better than GeGLU; best on some benchmarks (Shazeer, 2020)

All variants retain matching parameterization when hidden size is scaled accordingly. Results across language and structured modalities favor SwiGLU in terms of downstream accuracy, data efficiency, and representation power.

References

(Shazeer, 2020) GLU Variants Improve Transformer
(Yadav et al., 14 Jul 2025) AudioMAE++: learning better masked audio representations with SwiGLU FFNs
(Masrur et al., 14 Jan 2025) Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors