GLU Variants in Neural Networks

Updated 16 April 2026

GLU Variants are a family of elementwise gating mechanisms that replace standard activations with multiplicative interactions, boosting model expressivity.
Variants such as GEGLU, SwiGLU, and PolyGLU adjust nonlinearities and parameterizations to improve convergence, accuracy, and efficiency across tasks.
Innovative designs like MGLU and IGLU-Approx address computational and memory overheads through optimized kernel implementations and shared weights.

Gated Linear Units (GLU) and their numerous variants constitute a broad architectural family of elementwise gating mechanisms, now foundational in modern neural networks, particularly large transformers. The canonical GLU pattern replaces conventional single-path activations with a multiplicative interaction between two learned projections, where one branch is modulated by a nonlinearity (“gate”). This specification enables a vast design space by varying the gating function, combinatorics, parameterization, and architectural placement, leading to both efficiency and representational advantages across domains.

1. Mathematical Structure and Key GLU Variants

At its core, a GLU processes an input vector $x$ via two projections (possibly parameter-sharing or decoupling them), computing

$\mathrm{GLU}(x) = f(xW) \odot g(xV)$

where $f$ is typically the linear or “value” branch and $g$ is the nonlinearity applied to the “gate” branch; $\odot$ denotes elementwise multiplication. Variants arise by altering $g$ , duplication patterns, and augmenting this basic pattern.

Main GLU classes:

Original GLU: $f(u) = u$ , $g(u) = \sigma(u)$ (sigmoid) (Shazeer, 2020)
BilinearGLU: both branches linear; $g(u) = u$ (Pearce et al., 2024)
ReGLU/GEGLU/SwiGLU/MishGLU: different nonlinearities in $g$ (ReLU, GELU, SiLU “Swish”, Mish) (Shazeer, 2020, Le et al., 2024)
High-order GLU: e.g., $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 0, $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 1, including both $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 2 and $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 3 in the product (“second-order”), or even further multiplicands (Huang, 2024)
Decoupled gating: GaLU separates weights for linear and gating branches, optionally using hard gates ( $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 4) (Fiat et al., 2019)
Dynamic/routing gates: PolyGLU dynamically routes to multiple candidate activation functions per neuron (Medeiros, 7 Mar 2026)

Table 1 summarizes standard GLU forms. $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 5 are projection matrices, act $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 6 indicates the gating nonlinearity.

Variant	Formulation	Gate $\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 7
GLU	$\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 8	Sigmoid
Bilinear	$\mathrm{GLU}(x) = f(xW) \odot g(xV)$ 9	Linear
ReGLU	$f$ 0	ReLU
GEGLU	$f$ 1	GELU
SwiGLU	$f$ 2	Swish/SiLU
Second-order GLU	$f$ 3	— as above
GaLU	$f$ 4	Hard gate
PolyGLU	$f$ 5 ( $f$ 6: route/mixture)	ReLU, GELU, Tanh, SiLU

2. Theoretical Properties and Universal Approximation

GLU architectures admit a distinct piecewise-quadratic or higher-order structure, fundamentally augmenting the expressivity of single-path MLPs. For scalar, one-hidden-layer models:

Standard MLP: Equivalent to piecewise-linear splines, yielding an error scaling $f$ 7 (RMSE vs. parameter count) for function approximation.
GLU/second-order: Admits piecewise-quadratic fits, yielding $f$ 8 (Queiruga, 16 Feb 2026). The multiplicative gate increases local approximation order by one.
Gated Quadratic Unit (GQU): Incorporates an additional linear branch, resulting in asymptotic scaling between $f$ 9 and $g$ 0, outperforming both MLP and basic GLU on function approximation tasks.

This scaling advantage is achieved without extra parameter count relative to width-matched baselines, as GLU variants distribute hidden dimension size between linear and gating branches (Queiruga, 16 Feb 2026, Shazeer, 2020).

3. Modern Variants: Gating Function, Range, and Adaptivity

a. Gating Function Choices

Replacing the canonical sigmoid gate yields measurable accuracy and convergence improvements. Empirically, gates such as GELU (GEGLU), Swish/SiLU (SwiGLU), and Mish (MiGLU) frequently outperform the original GLU, with GEGLU and SwiGLU now widely deployed in transformers (Shazeer, 2020, Le et al., 2024). Table 2 provides accuracy metrics across a diverse pool of gates (Le et al., 2024):

Gate (g)	Accuracy	Precision	Recall	F1
Sigmoid (GLU)	0.98	0.90	0.97	0.93
GEGLU	0.97	0.88	0.94	0.91
SwiGLU	0.97	0.91	0.93	0.92
MiGLU (Mish)	0.98	0.91	0.95	0.93
ReGLU	0.97	0.89	0.94	0.92

b. Expanding the Gating Range

Standard gates restrict outputs to $g$ 1. Introducing a trainable scaling parameter $g$ 2 extends the range to $g$ 3, as in “xGLU”: $g$ 4 Empirically, this expansion closes much of the performance difference between first-order (single multiplication) and second-order (includes $g$ 5) GLU, and reliably improves perplexity in transformer language modeling (Huang, 2024).

c. Routing and Mixture-of-Gates

PolyGLU incorporates a per-neuron learnable, input-conditioned mixture over $g$ 6 candidate activation functions (e.g., ReLU, GELU, Tanh, SiLU), via a Gumbel-Softmax “router” trained end-to-end. This mechanism yields emergent, nearly deterministic gate selection, with pronounced depth-wise specialization (e.g., GELU early, Tanh deep), at just 0.23% parameter overhead. After training, routing converges to almost zero entropy, with each neuron using a single activation per example (Medeiros, 7 Mar 2026).

4. Architectural Parameterizations and Efficiency

The two-matrix GLU (or its variants) doubles weight loading within a block. This incurs memory bandwidth and compute overhead, particularly salient during inference in LLMs. Recent work addresses this via:

Masked Gated Linear Units (MGLU): Replace two projection matrices with one shared weight tensor and elementwise-learned binary masks (Mixture of Elementwise Gating, MoEG), so distinct gate/value subspaces are drawn from a single weight pool. Combined with optimized “FlashMGLU” CUDA kernels, this method cuts inference memory usage by up to 47% and latency by 34% versus standard GLUs, even with $g$ 7 or $g$ 8 masks (Tajima et al., 29 Jun 2025).
ReLU-based approximations: IGLU-Approx implements the Cauchy-gated IGLU via rational ReLU-only approximations, eliminating transcendental calls and achieving speed on par with ReLU or HardTanh, while capturing heavy-tailed robustness benefits (Kang et al., 6 Mar 2026).

Table 3 summarizes efficiency trade-offs:

Variant	Peak Mem/Token	Kernel Speedup	Downstream Acc.
GLU	$g$ 9 bits	$\odot$ 0	Baseline
MGLU ( $\odot$ 1)	$\odot$ 2 bits	$\odot$ 3	Matches/exceeds SwiGLU (Tajima et al., 29 Jun 2025)
IGLU-Approx	$\odot$ 4 bits	$\odot$ 5	Parity with IGLU, ReLU, GELU (Kang et al., 6 Mar 2026)

5. GLU Placement: Feedforward, Attention, and Residual Modules

While GLUs were first deployed in feedforward MLP blocks, their gating structure enables effective use in other contexts:

Attention Value Gating: GLU Attention applies elementwise gating (often SiLU) to the value projection within scaled dot-product attention, yielding improved convergence speed and final accuracy with marginal or zero extra parameter count (Wang, 16 Jun 2025).
Residual/Filter Layer Integration: Placing GLU-powered Gated Residual Networks (GRN) as intermediate “filters” in transformers, rather than within attention softmax, maximizes noise suppression and signal amplification, as measured by downstream mutual information estimates (Le et al., 2024).
SE Attention and CNNs: GLUSE combines channelwise SE recalibration with a parallel GLU-inspired gating branch, improving channel selectivity for very low-footprint CNNs deployed in ultra-resource-constrained environments (Le et al., 16 Apr 2025).

6. Empirical Benchmarks and Deployment Regimes

Extensive evaluations confirm GLU-based blocks—specifically, GEGLU, SwiGLU, their expanded-range and MGLU variants—are competitive or superior to classical activations (ReLU, GELU) across vision, language, and imbalanced classification. Concrete findings include:

Vision: IGLU (σ=0.1) outperforms ReLU and GELU in CIFAR-10/100 on ResNet-20, especially under class imbalance. ResNet-GLUSE delivers 94%+ accuracy on EuroSAT with $\odot$ 6 fewer parameters than MobileViT (Kang et al., 6 Mar 2026, Le et al., 16 Apr 2025).
Language modeling: SwiGLU and PolyGLU match or surpass GELU and vanilla GLU on WikiText and SuperGLUE/GLUE, with enhanced convergence and reduced perplexity (Medeiros, 7 Mar 2026, Shazeer, 2020).
Efficiency: MGLU’s FlashMGLU kernel attains $\odot$ 7 speedup over naive MGLU, while IGLU-Approx recovers full IGLU performance at ReLU cost (Kang et al., 6 Mar 2026, Tajima et al., 29 Jun 2025).
Mutual information: Sigmoid-gated GLU maximizes post-filter MI as measured by MINE, providing large discrimination gains in low-data, high-noise regimes (Le et al., 2024).
Interpretability: Bilinear GLU variants enable spectral tensor decomposition, producing mono-semantic, interpretable eigenfeatures; large LMs can be finetuned into bilinear form with negligible perplexity penalty (Pearce et al., 2024).

7. Design Principles, Limitations, and Future Directions

The growing GLU variant taxonomy is characterized by a modular design space:

Nonlinearity/gate choice: Each new nonlinearity alters representational power and gradient dynamics; experimental verifications remain essential.
Gating range expansion: Trainable expansion parameters robustly improve gradient flow and bring first- and second-order variants to parity (Huang, 2024).
Efficiency: Shared-matrix masking (MGLU) and rational approximations (IGLU-Approx) are practical solutions to bandwidth and hardware bottlenecks for deployment in LLM and embedded scenarios (Tajima et al., 29 Jun 2025, Kang et al., 6 Mar 2026).
Routing: PolyGLU’s per-neuron activation palette introduces a new axis of network flexibility, though empirical gains are context-dependent and may converge to fixed specializations at scale (Medeiros, 7 Mar 2026).
Theoretical guarantees: Decoupled gating (GaLU) admits tight memorization and generalization bounds for shallow networks, and MLP-vs-GLU scaling laws elucidate GLU’s structural approximation strengths (Fiat et al., 2019, Queiruga, 16 Feb 2026).

Limitations include possible diminishing returns beyond moderate mixture components or gate expansion, the challenge of deploying GLU-based blocks at extreme scale without custom kernel optimization, and open questions around optimal gating distributions and per-layer specialization in deep transformers.

Continued research directions include adapting GLU routing for sparse/efficient experts, integrating GLU attention into large-scale multi-head contexts, exploring cross-modal GLU deployments, and fusing gating mechanisms with emerging neuromorphic computation platforms. The modularity of the GLU formalism ensures enduring relevance across deep learning frontiers.