Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated Linear Unit (GLU) Overview

Updated 17 March 2026
  • GLU is a neural activation module that fuses two parallel linear projections using an elementwise gate, such as a sigmoid, to regulate information flow.
  • It enhances model performance by enabling dynamic, data-dependent gating, offering improved gradient propagation and superior scaling compared to traditional activations.
  • Variants like GEGLU, SwiGLU, and MGLU optimize computational efficiency and performance in Transformers, LLMs, and other deep learning architectures.

A Gated Linear Unit (GLU) is a neural activation module that fuses two parallel linear projections through elementwise multiplication, one branch optionally passed through a nonlinear “gate” function such as sigmoid. This gating architecture grants the network dynamic, data-dependent control over the flow of information in each hidden dimension, and has demonstrated superior expressivity and empirical performance relative to standard activations in a broad set of domains, notably feed-forward blocks in Transformers, convolutional sequence models, mixture-of-experts (MoE) conversions, and modern LLMs (Shazeer, 2020, Zhao et al., 17 Feb 2026, Tajima et al., 29 Jun 2025, Abdullah et al., 2024, Hou et al., 2018). The multiplicative structure fundamentally distinguishes GLUs from pointwise nonlinearities (e.g., ReLU, GELU): it supports both “linear gradient paths” and per-unit dynamic gating, facilitating both optimization and selective routing. The following summary reviews the mathematical formulation, architectural roles, variants, scaling-laws, applications, and computational advances associated with GLUs.

1. Mathematical Formulation and Variants

A standard GLU takes an input vector xRdx\in\mathbb{R}^d and computes

GLU(x)=(Ax)σ(Bx)\mathrm{GLU}(x) = (A x) \odot \sigma(B x)

where A,BRd×dA,B\in\mathbb{R}^{d\times d} are learned matrices, σ\sigma is a gate nonlinearity (commonly sigmoid), and \odot denotes elementwise product (Shazeer, 2020, Hou et al., 2018). This splits the hidden layer into “information” and “gate” streams, modulating each coordinate.

Shazeer (Shazeer, 2020) systematically investigates several GLU variants; all share the form

FFN(x)=(ϕ(xW)xV)W2\mathrm{FFN}_\star(x) = (\phi(x W)\odot x V) W_2

with distinct gate functions ϕ\phi:

Variant Gate Function ϕ\phi Formula / Description
GLU σ(u)\sigma(u) σ(xW)xV\sigma(xW)\odot xV
Bilinear uu xWxVxW\odot xV
ReGLU max(0,u)\max(0,u) max(0,xW)xV\max(0,xW)\odot xV
GEGLU GELU(u)\mathrm{GELU}(u) GELU(xW)xV\mathrm{GELU}(xW)\odot xV
SwiGLU Swish(u)\mathrm{Swish}(u) Swish(xW)xV\mathrm{Swish}(xW)\odot xV

Additional architectural variants include:

2. Theoretical Properties and Scaling Laws

GLU-based architectures possess higher function-approximation capacity and more favorable scaling compared to conventional MLPs. Mathematically, with ReLU gating,

GLU(x)=i=1nDiσ(Gix+gi)(Uix+ui)\text{GLU}(x) = \sum_{i=1}^n D_i \sigma(G_i x + g_i) (U_i x + u_i)

each active neuron represents a quadratic polynomial over input xx, giving rise to a piecewise-quadratic spline. In contrast, standard ReLU-MLPs are piecewise-linear.

Empirical and analytical studies demonstrate that the L2L^2 reconstruction loss L(P)L(P) for a GLU network with parameter count PP scales as L(P)P3L(P) \propto P^{-3} (quadratic order), in contrast to L(P)P2L(P) \propto P^{-2} for MLPs (Queiruga, 16 Feb 2026). The “Gated Quadratic Unit” (GQU) achieves an even steeper rate, L(P)P4L(P) \propto P^{-4}. These scaling advantages are realized in both synthetic approximation tasks and serve to explain the empirical dominance of GLU-MLPs at large scale.

In the context of convex random feature models, “GaLU” decoupling enables optimal memorization of mm samples in dd dimensions with only O~(m/d)\tilde O(m/d) neurons—matching empirical observations and outperforming theoretical ReLU data requirements (Fiat et al., 2019).

3. Integration into Model Architectures

Feed-Forward Networks in Transformers

GLUs replace the conventional two-matrix structure of Transformer FFNs (linear-ReLU-linear) with three-matrix gated blocks. To control parameter count and compute, the hidden dimension is reduced by a factor (typically $2/3$) (Shazeer, 2020). In sequence-to-sequence and language modeling, GLU, GEGLU, and SwiGLU consistently outperform ReLU/GELU baselines in pretraining perplexity and downstream GLUE/SQuAD/SuperGLUE accuracy.

Attention Mechanisms

GLU Attention applies the gating operation to the value projection in multi-head attention, injecting a nonlinearity into the value pathway. This increases head expressivity while preserving compatibility with FlashAttention, rotary embeddings, and grouped-query designs. Empirical results on both vision (CIFAR-10) and text (WikiText) tasks indicate modest but consistent increases in accuracy and convergence speed, with no increase in parameter count or nominal FLOPs (Wang, 16 Jun 2025).

Convolutional and Recurrent Networks

GLUs have been incorporated into convolutional recurrent models for polyphonic audio tagging. When replacing ReLU with GLU in 2D convolutions (via parallel convolutional kernels), and coupled with CTC loss, this yielded a relative increase of ≈5% in AUC, producing sharper, event-aligned activations and improving learning in deep stacks (Hou et al., 2018).

Vision Transformers

GLU/GEGLU units have been shown to substitute for both the self-attention and MLP sublayers in ViT-style architectures. The GEGLU-only block reduces parameter count and FLOPs, attains linear O(nd2)O(n d^2) complexity (in number of tokens), and delivers competitive or improved accuracy on vision benchmarks relative to standard ViT and Mixer baselines (Abdullah et al., 2024).

Mixture-of-Experts Conversion

GLU activation patterns directly reveal a coarse-grained MoE decomposition within dense LLMs. Neurons with consistently high gate activations across tasks (“universal”) are grouped into shared experts, while specialized patterns are clustered into routed experts. The ExpertWeaver framework exploits this property for training-free conversion of dense models to high-quality sparse MoEs, outperforming literature baselines in both dynamic pruning and MoE initialization (Zhao et al., 17 Feb 2026).

4. Anatomy, Interpretability, and Activation Patterns

GLU neurons are characterized by a two-stream computation: a gating stream and an information (“in”) stream. Each neuron's activation is modeled as ugu \odot g, where uu is the pre-linear input and gg is the gate (after nonlinearity). The SwiGLU variant allows both terms to be positive or negative, leading to four potential sign regimes: (+,+)(+,+), (+,)(+,-), (,+)(-,+), (,)(-,-). GLUScope, an interpretability tool, decomposes neuron behaviors by sign combination, revealing that rare regimes (e.g., (,)(-,-)) can encode specific lexical or logical patterns, such as repetitive tokens (“once again”). Monitoring these regimes offers deeper insight into the functional role of GLU-based neurons and differentiates their capacity from one-dimensional activations (Gerstner et al., 27 Feb 2026).

5. Computational Efficiency and Memory Optimization

A significant hardware bottleneck for standard GLUs is the doubled high-bandwidth memory (HBM) usage due to separate gate and value matrices. The Masked Gated Linear Unit (MGLU) addresses this by learning binary masks that partition a single weight matrix into gate/value elements. The FlashMGLU kernel implements this via pre-packed mask bits and tilewise fused computation, reducing memory reads by up to 47% and enabling 19.7× lower inference latency compared to naive implementations, while matching or sometimes exceeding SwiGLU in LLM benchmarks (Tajima et al., 29 Jun 2025).

MGLU delivers maximal gains in low-batch, real-time, or memory-bound scenarios. The mechanism generalizes to mixture-of-mask designs (MoEG), balancing expressivity and bandwidth, and is especially suitable for deployment on edge or accelerator-limited platforms.

6. Empirical Performance and Benchmark Results

In LLM pretraining (on C4, T5-sized models), GLU variants outperform ReLU and GELU on held-out perplexity and a suite of downstream tasks (Shazeer, 2020). Best results are offered by GEGLU, SwiGLU, and ReGLU. In audio tagging, GLU-CTC produced an AUC of 0.882 versus 0.837 for ReLU (CTC) and 0.803 (GMP), confirming the practical impact of gating for sequence labeling (Hou et al., 2018).

In vision, the GEGLU-only Activator achieves 73.20% test accuracy on CIFAR-10, matching or exceeding MLP-Mixer and Synthesizer baselines with considerably less compute (Abdullah et al., 2024). For clinical time series, the sigmoid-gated GLU outperforms 10 alternative activations in gated residual networks, reaching 0.98 test accuracy and high recall/precision in PPG artifact detection (Le et al., 2024).

GLU-based dynamic MoE conversion in LLMs yields MoE variants that exceed the performance of structurally pruned or randomly downcycled models at comparable sparsity levels (e.g., +5.6% on LLaMA3-8B at 25% sparsity) (Zhao et al., 17 Feb 2026).

7. Practical Recommendations and Future Directions

  • Use GEGLU or SwiGLU as the FFN nonlinearity in new Transformer/LLM architectures for improved generalization and gradient flow (Shazeer, 2020).
  • In resource-conscious or memory-bound engineering, employ MGLU (with FlashMGLU) to maximize hardware efficiency without sacrificing accuracy (Tajima et al., 29 Jun 2025).
  • When converting dense to MoE models, exploit intrinsic GLU activation patterns for expert selection and routing (Zhao et al., 17 Feb 2026).
  • For clinical or scarce-data environments, default to bounded, smooth gates (e.g., sigmoid), and apply gating as an external filter rather than inside attention (Le et al., 2024).
  • Theoretically, further exploration of higher-order “gated polynomial units” (e.g., GQU) may offer even steeper scaling and better extrapolation in future architectures (Queiruga, 16 Feb 2026).

A plausible implication is that the piecewise-quadratic nature of GLUs should guide ongoing architecture search and scaling design in large models. Gating not only boosts representational capacity and gradient propagation but also facilitates conditional computation and model sparsification, suggesting broad utility across sequential, vision, audio, and multimodal deep learning (Shazeer, 2020, Wang, 16 Jun 2025, Hou et al., 2018, Tajima et al., 29 Jun 2025, Zhao et al., 17 Feb 2026, Abdullah et al., 2024, Queiruga, 16 Feb 2026, Fiat et al., 2019, Le et al., 2024, Gerstner et al., 27 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Linear Unit (GLU).