SwiGLU: Swish-Gated Linear Unit
- SwiGLU is a feed-forward architectural building block that replaces standard FFNs in Transformers using a smooth Swish-based gating mechanism.
- It combines dual linear projections with element-wise operations to improve convergence and boost performance across tasks like language modeling, audio analysis, and wireless signal processing.
- Empirical results demonstrate improved optimization and efficiency by matching parameter counts through reduced hidden sizes while delivering robust downstream performance.
The Swish-Gated Linear Unit (SwiGLU) is a feed-forward architectural building block for neural networks, particularly prominent as a substitute for the standard multilayer perceptron (MLP) or FFN in Transformer models. It generalizes the original Gated Linear Unit (GLU) by replacing the sigmoid gating function with the non-monotonic, smooth Swish activation. SwiGLU has been empirically demonstrated to deliver improved optimization and downstream performance across diverse modalities including language, audio, and wireless signal processing, without increasing parameter or computational complexity compared to typical two-layer FFNs (Yadav et al., 14 Jul 2025, Shazeer, 2020, Masrur et al., 14 Jan 2025).
1. Formal Definition and Mathematical Formulation
The SwiGLU block combines two linear projections and applies element-wise gating via the Swish activation function, followed by a final linear recombination:
Let be the FFN input, be expansion matrices, and be the output projection. The operation is
with denoting element-wise (Hadamard) product. The Swish activation is
This construction merges activation and gating into one smoother, non-monotonic function in contrast to prior Gated Linear Unit variants such as:
- GLU:
- GeGLU: (Shazeer, 2020)
The feed-forward sublayer in Transformer architectures thus replaces the canonical two-layer GELU-FFN (or ReLU-FFN):
with the SwiGLU transformation. An equivalent formulation computes a joint projection into , splits it, applies Swish to one half, and gates the other (Shazeer, 2020).
2. Integration with Transformer Architectures
SwiGLU functions as a drop-in alternative to the standard FFN in the Transformer block. In contemporary implementations:
- AudioMAE++ employs SwiGLU inside macaron-style blocks, replacing the traditional two-layer GELU-FFN with a single SwiGLU FFN (Yadav et al., 14 Jul 2025).
- The L-SwiGLU Transformer architecture integrates SwiGLU after multi-head attention, paired with RMSNorm, and omits positional embeddings and class tokens, utilizing global average pooling for output aggregation (Masrur et al., 14 Jan 2025).
- Canonical layernorm, residual connections, and block ordering remain unchanged unless otherwise specified.
A typical encoder block with SwiGLU in the L-SwiGLU variant is structured as:
Input RMSNorm MHA residual RMSNorm SwiGLU residual Output
3. Computational Complexity and Parameters
The SwiGLU block uses three learned matrices (, , ) versus two in the standard FFN, but with a reduced hidden/intermediate size—typically by a factor of $2/3$—so total parameter count and floating-point operation (FLOP) cost are matched to that of the two-layer FFN (Shazeer, 2020, Masrur et al., 14 Jan 2025). Specifically, for input/output dimension and hidden size :
| Architecture | Parameter Count | Per-token FLOPs |
|---|---|---|
| Vanilla FFN | $2dh$ | $4dh$ |
| SwiGLU | () |
This adjustment ensures computational parity and supports direct replacement in existing architectures.
4. Empirical Results and Benchmarks
Empirical evaluations consistently show that SwiGLU delivers modest but measurable improvements in both pre-training and fine-tuning settings:
- In T5-style Transformer models, SwiGLU achieved held-out English text log-perplexity of $1.944$ (vs. GELU $1.679$, GeGLU $1.633$), and GLUE/SuperGLUE scores of $84.36/74.56$, outperforming ReLU, GELU, and Swish-activated FFN variants (Shazeer, 2020).
- In AudioMAE++, swapping GELU-FFN for SwiGLU in macaron blocks improved the aggregated HEAR score by points (e.g., $91.8$ for AudioMAE++ Base vs. $63.0$ for AudioMAE baseline), with consistent improvement across 10 audio classification and speech benchmarks (Yadav et al., 14 Jul 2025).
- In indoor localization with distributed wireless sensors, the L-SwiGLU ViT yields an 8.51% reduction in 90th-percentile 2D error over vanilla MLP Transformer (from 0.388 m to 0.355 m) and outperforms a 14.1 larger vanilla model by 46.13% in the same metric (Masrur et al., 14 Jan 2025).
These findings are robust across task domains (language, audio, signal processing) and model scales.
5. Functional and Theoretical Motivation
SwiGLU’s core theoretical appeal arises from the per-dimension adaptive gating conferred by the GLU structure, with the Swish function yielding several advantages over classical sigmoid or GELU gates:
- The non-monotonic Swish gate permits non-vanishing gradients for negative inputs, unlike sigmoid, supporting better flow of information and gradient during backpropagation (Yadav et al., 14 Jul 2025).
- Per-dimension control: each hidden feature can be selectively amplified or suppressed, leading to more expressive subspace projections, shown valuable for reconstructing highly variable spectrogram patches or for filtering informative paths in noisy wireless signal scenarios (Yadav et al., 14 Jul 2025, Masrur et al., 14 Jan 2025).
- Empirical studies suggest Swish gating improves convergence relative to sigmoid or plain ReLU/GELU nonlinearities (Yadav et al., 14 Jul 2025, Shazeer, 2020). No explicit analysis of gradient norm or dynamic range is available in the literature.
6. Implementation Details
SwiGLU transformations typically initialize , , and using the same Xavier scheme as the surrounding Transformer layers. No auxiliary scaling or per-gate bias parameters are required; all gating and output mixing are subsumed within the three linear transforms. The block admits efficient batched implementation due to its reliance on matrix multiplications and componentwise operations (Yadav et al., 14 Jul 2025, Shazeer, 2020).
7. Comparative Summary with Related Units
A direct comparison among gating-unit FFN variants is summarized below:
| Variant | Gating Function | Empirical Outcome |
|---|---|---|
| GLU | Sigmoid | Stronger than ReLU/GELU |
| GeGLU | GELU | Slightly better than GLU |
| SwiGLU | Swish | Equal or better than GeGLU; best on some benchmarks (Shazeer, 2020) |
All variants retain matching parameterization when hidden size is scaled accordingly. Results across language and structured modalities favor SwiGLU in terms of downstream accuracy, data efficiency, and representation power.
References
- (Shazeer, 2020) GLU Variants Improve Transformer
- (Yadav et al., 14 Jul 2025) AudioMAE++: learning better masked audio representations with SwiGLU FFNs
- (Masrur et al., 14 Jan 2025) Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors