Papers
Topics
Authors
Recent
Search
2000 character limit reached

GLU Variants in Neural Networks

Updated 16 April 2026
  • GLU Variants are a family of elementwise gating mechanisms that replace standard activations with multiplicative interactions, boosting model expressivity.
  • Variants such as GEGLU, SwiGLU, and PolyGLU adjust nonlinearities and parameterizations to improve convergence, accuracy, and efficiency across tasks.
  • Innovative designs like MGLU and IGLU-Approx address computational and memory overheads through optimized kernel implementations and shared weights.

Gated Linear Units (GLU) and their numerous variants constitute a broad architectural family of elementwise gating mechanisms, now foundational in modern neural networks, particularly large transformers. The canonical GLU pattern replaces conventional single-path activations with a multiplicative interaction between two learned projections, where one branch is modulated by a nonlinearity (“gate”). This specification enables a vast design space by varying the gating function, combinatorics, parameterization, and architectural placement, leading to both efficiency and representational advantages across domains.

1. Mathematical Structure and Key GLU Variants

At its core, a GLU processes an input vector xx via two projections (possibly parameter-sharing or decoupling them), computing

GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)

where ff is typically the linear or “value” branch and gg is the nonlinearity applied to the “gate” branch; \odot denotes elementwise multiplication. Variants arise by altering gg, duplication patterns, and augmenting this basic pattern.

Main GLU classes:

  • Original GLU: f(u)=uf(u) = u, g(u)=σ(u)g(u) = \sigma(u) (sigmoid) (Shazeer, 2020)
  • BilinearGLU: both branches linear; g(u)=ug(u) = u (Pearce et al., 2024)
  • ReGLU/GEGLU/SwiGLU/MishGLU: different nonlinearities in gg (ReLU, GELU, SiLU “Swish”, Mish) (Shazeer, 2020, Le et al., 2024)
  • High-order GLU: e.g., GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)0, GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)1, including both GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)2 and GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)3 in the product (“second-order”), or even further multiplicands (Huang, 2024)
  • Decoupled gating: GaLU separates weights for linear and gating branches, optionally using hard gates (GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)4) (Fiat et al., 2019)
  • Dynamic/routing gates: PolyGLU dynamically routes to multiple candidate activation functions per neuron (Medeiros, 7 Mar 2026)

Table 1 summarizes standard GLU forms. GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)5 are projection matrices, actGLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)6 indicates the gating nonlinearity.

Variant Formulation Gate GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)7
GLU GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)8 Sigmoid
Bilinear GLU(x)=f(xW)g(xV)\mathrm{GLU}(x) = f(xW) \odot g(xV)9 Linear
ReGLU ff0 ReLU
GEGLU ff1 GELU
SwiGLU ff2 Swish/SiLU
Second-order GLU ff3 — as above
GaLU ff4 Hard gate
PolyGLU ff5 (ff6: route/mixture) ReLU, GELU, Tanh, SiLU

2. Theoretical Properties and Universal Approximation

GLU architectures admit a distinct piecewise-quadratic or higher-order structure, fundamentally augmenting the expressivity of single-path MLPs. For scalar, one-hidden-layer models:

  • Standard MLP: Equivalent to piecewise-linear splines, yielding an error scaling ff7 (RMSE vs. parameter count) for function approximation.
  • GLU/second-order: Admits piecewise-quadratic fits, yielding ff8 (Queiruga, 16 Feb 2026). The multiplicative gate increases local approximation order by one.
  • Gated Quadratic Unit (GQU): Incorporates an additional linear branch, resulting in asymptotic scaling between ff9 and gg0, outperforming both MLP and basic GLU on function approximation tasks.

This scaling advantage is achieved without extra parameter count relative to width-matched baselines, as GLU variants distribute hidden dimension size between linear and gating branches (Queiruga, 16 Feb 2026, Shazeer, 2020).

3. Modern Variants: Gating Function, Range, and Adaptivity

a. Gating Function Choices

Replacing the canonical sigmoid gate yields measurable accuracy and convergence improvements. Empirically, gates such as GELU (GEGLU), Swish/SiLU (SwiGLU), and Mish (MiGLU) frequently outperform the original GLU, with GEGLU and SwiGLU now widely deployed in transformers (Shazeer, 2020, Le et al., 2024). Table 2 provides accuracy metrics across a diverse pool of gates (Le et al., 2024):

Gate (g) Accuracy Precision Recall F1
Sigmoid (GLU) 0.98 0.90 0.97 0.93
GEGLU 0.97 0.88 0.94 0.91
SwiGLU 0.97 0.91 0.93 0.92
MiGLU (Mish) 0.98 0.91 0.95 0.93
ReGLU 0.97 0.89 0.94 0.92

b. Expanding the Gating Range

Standard gates restrict outputs to gg1. Introducing a trainable scaling parameter gg2 extends the range to gg3, as in “xGLU”: gg4 Empirically, this expansion closes much of the performance difference between first-order (single multiplication) and second-order (includes gg5) GLU, and reliably improves perplexity in transformer language modeling (Huang, 2024).

c. Routing and Mixture-of-Gates

PolyGLU incorporates a per-neuron learnable, input-conditioned mixture over gg6 candidate activation functions (e.g., ReLU, GELU, Tanh, SiLU), via a Gumbel-Softmax “router” trained end-to-end. This mechanism yields emergent, nearly deterministic gate selection, with pronounced depth-wise specialization (e.g., GELU early, Tanh deep), at just 0.23% parameter overhead. After training, routing converges to almost zero entropy, with each neuron using a single activation per example (Medeiros, 7 Mar 2026).

4. Architectural Parameterizations and Efficiency

The two-matrix GLU (or its variants) doubles weight loading within a block. This incurs memory bandwidth and compute overhead, particularly salient during inference in LLMs. Recent work addresses this via:

  • Masked Gated Linear Units (MGLU): Replace two projection matrices with one shared weight tensor and elementwise-learned binary masks (Mixture of Elementwise Gating, MoEG), so distinct gate/value subspaces are drawn from a single weight pool. Combined with optimized “FlashMGLU” CUDA kernels, this method cuts inference memory usage by up to 47% and latency by 34% versus standard GLUs, even with gg7 or gg8 masks (Tajima et al., 29 Jun 2025).
  • ReLU-based approximations: IGLU-Approx implements the Cauchy-gated IGLU via rational ReLU-only approximations, eliminating transcendental calls and achieving speed on par with ReLU or HardTanh, while capturing heavy-tailed robustness benefits (Kang et al., 6 Mar 2026).

Table 3 summarizes efficiency trade-offs:

Variant Peak Mem/Token Kernel Speedup Downstream Acc.
GLU gg9 bits \odot0 Baseline
MGLU (\odot1) \odot2 bits \odot3 Matches/exceeds SwiGLU (Tajima et al., 29 Jun 2025)
IGLU-Approx \odot4 bits \odot5 Parity with IGLU, ReLU, GELU (Kang et al., 6 Mar 2026)

5. GLU Placement: Feedforward, Attention, and Residual Modules

While GLUs were first deployed in feedforward MLP blocks, their gating structure enables effective use in other contexts:

  • Attention Value Gating: GLU Attention applies elementwise gating (often SiLU) to the value projection within scaled dot-product attention, yielding improved convergence speed and final accuracy with marginal or zero extra parameter count (Wang, 16 Jun 2025).
  • Residual/Filter Layer Integration: Placing GLU-powered Gated Residual Networks (GRN) as intermediate “filters” in transformers, rather than within attention softmax, maximizes noise suppression and signal amplification, as measured by downstream mutual information estimates (Le et al., 2024).
  • SE Attention and CNNs: GLUSE combines channelwise SE recalibration with a parallel GLU-inspired gating branch, improving channel selectivity for very low-footprint CNNs deployed in ultra-resource-constrained environments (Le et al., 16 Apr 2025).

6. Empirical Benchmarks and Deployment Regimes

Extensive evaluations confirm GLU-based blocks—specifically, GEGLU, SwiGLU, their expanded-range and MGLU variants—are competitive or superior to classical activations (ReLU, GELU) across vision, language, and imbalanced classification. Concrete findings include:

  • Vision: IGLU (σ=0.1) outperforms ReLU and GELU in CIFAR-10/100 on ResNet-20, especially under class imbalance. ResNet-GLUSE delivers 94%+ accuracy on EuroSAT with \odot6 fewer parameters than MobileViT (Kang et al., 6 Mar 2026, Le et al., 16 Apr 2025).
  • Language modeling: SwiGLU and PolyGLU match or surpass GELU and vanilla GLU on WikiText and SuperGLUE/GLUE, with enhanced convergence and reduced perplexity (Medeiros, 7 Mar 2026, Shazeer, 2020).
  • Efficiency: MGLU’s FlashMGLU kernel attains \odot7 speedup over naive MGLU, while IGLU-Approx recovers full IGLU performance at ReLU cost (Kang et al., 6 Mar 2026, Tajima et al., 29 Jun 2025).
  • Mutual information: Sigmoid-gated GLU maximizes post-filter MI as measured by MINE, providing large discrimination gains in low-data, high-noise regimes (Le et al., 2024).
  • Interpretability: Bilinear GLU variants enable spectral tensor decomposition, producing mono-semantic, interpretable eigenfeatures; large LMs can be finetuned into bilinear form with negligible perplexity penalty (Pearce et al., 2024).

7. Design Principles, Limitations, and Future Directions

The growing GLU variant taxonomy is characterized by a modular design space:

  • Nonlinearity/gate choice: Each new nonlinearity alters representational power and gradient dynamics; experimental verifications remain essential.
  • Gating range expansion: Trainable expansion parameters robustly improve gradient flow and bring first- and second-order variants to parity (Huang, 2024).
  • Efficiency: Shared-matrix masking (MGLU) and rational approximations (IGLU-Approx) are practical solutions to bandwidth and hardware bottlenecks for deployment in LLM and embedded scenarios (Tajima et al., 29 Jun 2025, Kang et al., 6 Mar 2026).
  • Routing: PolyGLU’s per-neuron activation palette introduces a new axis of network flexibility, though empirical gains are context-dependent and may converge to fixed specializations at scale (Medeiros, 7 Mar 2026).
  • Theoretical guarantees: Decoupled gating (GaLU) admits tight memorization and generalization bounds for shallow networks, and MLP-vs-GLU scaling laws elucidate GLU’s structural approximation strengths (Fiat et al., 2019, Queiruga, 16 Feb 2026).

Limitations include possible diminishing returns beyond moderate mixture components or gate expansion, the challenge of deploying GLU-based blocks at extreme scale without custom kernel optimization, and open questions around optimal gating distributions and per-layer specialization in deep transformers.

Continued research directions include adapting GLU routing for sparse/efficient experts, integrating GLU attention into large-scale multi-head contexts, exploring cross-modal GLU deployments, and fusing gating mechanisms with emerging neuromorphic computation platforms. The modularity of the GLU formalism ensures enduring relevance across deep learning frontiers.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Linear Units (GLU) Variants.