Gated Linear Units (GLUs) in Neural Networks

Updated 17 March 2026

GLUs are neural mechanisms that combine two linear projections with a nonlinear gating function via elementwise multiplication, enhancing expressivity.
They achieve superior function approximation, demonstrated by improved RMSE scaling laws (e.g., RMSE ∝ P⁻³) compared to traditional activations like ReLU.
Variants such as SwiGLU and hardware-efficient MGLU optimize computational and memory efficiency, making them integral in state-of-the-art language and vision models.

A Gated Linear Unit (GLU) is a neural network mechanism that introduces a learned, data-dependent gating structure into feed-forward computations via the elementwise multiplication of two linear projections, typically with a nonlinearity applied to at least one branch. Originally proposed in the context of convolutional networks for language modeling, GLUs have since evolved into a broad family of architectures and activations—now dominant in state-of-the-art LLMs—encompassing multiple functional forms, hardware-optimized variants, and deep theoretical analysis of their superior approximation properties.

1. Mathematical Formulations and Variants

Fundamentally, a GLU module maps an input vector $x \in \mathbb{R}^d$ to

$\mathrm{GLU}(x) = A(x) \odot g\bigl(B(x)\bigr)$

where $A$ and $B$ are independent affine (or convolutional) projections, $g(\cdot)$ is a nonlinear gating function (typically sigmoid, SiLU, GELU, etc.), and $\odot$ denotes elementwise multiplication (Dauphin et al., 2016, Shazeer, 2020, Kang et al., 6 Mar 2026). This definition generalizes naturally across feed-forward, convolutional, and attention-based architectures.

Prominent GLU variants include:

Original GLU: $A(x) \odot \sigma(B(x))$ , with $g$ as logistic sigmoid (Dauphin et al., 2016).
SwiGLU: $A(x) \odot \mathrm{Swish}(B(x))$ , where $\mathrm{Swish}(z)=z\sigma(z)$ (Shazeer, 2020).
GEGLU: $A(x) \odot \mathrm{GELU}(B(x))$ , with GELU as Gaussian error gating (Shazeer, 2020).
ReGLU: $A(x) \odot \mathrm{ReLU}(B(x))$ (Shazeer, 2020).
Masked GLU (MGLU): A hardware-efficient variant using a shared weight matrix and learned binary gating masks to reduce memory bandwidth (Tajima et al., 29 Jun 2025).

In convolutional contexts, GLUs operate as

$H = (X * W + b) \odot \sigma(X * V + c)$

where $X$ is a sequence or feature map, $W$ and $V$ are convolutional kernels, and * denotes convolution (Dauphin et al., 2016).

2. Expressive Power, Scaling Laws, and Approximation Order

GLUs exhibit fundamentally superior approximation capabilities relative to standard MLPs and ReLU-MLPs. For 1D function regression, a single hidden-layer GLU with $n$ neurons constitutes a sum of piecewise quadratic polynomials over regions defined by the gate activations. This contrasts with the piecewise linearity of ReLU-MLPs.

A seminal result is that for smooth function approximation over $[0,1]$ , GLUs achieve a root-mean-square error (RMSE) scaling law $\mathcal{L}(P) \propto P^{-3}$ , where $P$ is parameter count, whereas ReLU-MLPs incur only $\mathcal{L}(P) \propto P^{-2}$ scaling (Queiruga, 16 Feb 2026). The improvement is a direct manifestation of the native $x^2$ basis embedded within each gated unit. Empirically, these scaling exponents are realized with log–log slopes exceeding –3 for GLUs (versus –2 for MLPs) under full-batch Newton updates on canonical function reconstruction tasks (Queiruga, 16 Feb 2026).

Further, the Gated Quadratic Unit (GQU) demonstrates that augmenting the gating structure to yield piecewise cubic bases increases the convergence rate to approximately $\mathcal{L}(P) \propto P^{-3.5}$ , suggesting that gating order systematically lifts the achievable function approximation regime (Queiruga, 16 Feb 2026).

3. Architectural and Theoretical Considerations

GLUs operate by factorizing the linear transformation of standard FFN or convolutional layers into parallel "value" and "gate" streams, whose interaction via elementwise product permits data-dependent modulation at the granularity of individual hidden units (Dauphin et al., 2016, Shazeer, 2020). This mechanism differs fundamentally from LSTM-style gating, which modulates memory cells across multiple recurrent states, and from ReLU, which applies a non-parametric hard threshold to all preactivations jointly.

In theoretical research, decoupling gating from linearity—as in the GaLU formulation, where the gate and linear component have independent random and trained weights, respectively—yields strong memorization and generalization bounds: GaLU networks can memorize $m$ samples in $\tilde{\Omega}(m/d)$ width (versus the $\tilde{\Omega}(m^2/d)$ bound for ReLU networks) and provide $O(1/m)$ generalization rates in overparameterized settings (Fiat et al., 2019).

A key mechanistic insight is that the GLU gradient possesses a linear bypass path, mitigating vanishing gradient issues prevalent with deeper architectures or when using saturating activations (Dauphin et al., 2016, Kang et al., 6 Mar 2026). The gating function's nature (e.g., sigmoid, GELU, SiLU, or even heavy-tailed gates as in IGLU) modulates both representational capacity and the optimization landscape.

4. Integration in Modern Neural Models

GLUs and their variants are now a core primitive in transformer-based architectures and LLMs. The standard implementation in transformer FFNs is to project inputs to disjoint “value“ and “gate” subspaces, activate the gate (often with Swish, GELU, or SiLU), multiply, and project back to model dimension (Shazeer, 2020, Tajima et al., 29 Jun 2025).

The empirical superiority of GLU-style FFNs over ReLU and GELU baselines is robust across pretraining (C4 perplexity), fine-tuning (GLUE, SuperGLUE, SQuAD), vision (CIFAR-10, CIFAR-100), and language (WikiText-103) tasks, with consistent gains on both convergence speed and final accuracy (Shazeer, 2020, Kang et al., 6 Mar 2026, Tajima et al., 29 Jun 2025).

GLUs have been generalized to attention mechanisms ("GLU-Attention"), where the value projection in multi-head attention is gated, yielding faster convergence and improved performance on text (WikiText-2, WikiText-103) and vision (CIFAR-10) benchmarks, with no increase in parameter count or substantial computational overhead (Wang, 16 Jun 2025).

Hardware-motivated GLU variants, such as MGLU, achieve similar or superior accuracy to classic SwiGLU at notably reduced memory bandwidth and inference-time latency, as shown by sub-millisecond kernel timings and decreased bits transferred per token on high-end GPUs (Tajima et al., 29 Jun 2025).

5. Activation Function Innovations

GLU-inspired activations, such as IGLU ( $\mathrm{IGLU}(x;\sigma) = x \left(\frac12 + \frac{1}{\pi} \arctan(\sigma x)\right)$ ), provide a principled continuum between identity and hard-threshold gating via a Cauchy CDF gate. Unlike GELU's Gaussian tail, IGLU's polynomial decay guarantees a nonzero gradient everywhere, conferring significant robustness to vanishing gradients and improved performance on imbalanced datasets (Kang et al., 6 Mar 2026). A rational approximation, IGLU-Approx, recovers this performance with only ReLU operations, enabling efficient deployment.

Empirical results on standard benchmarks (CIFAR-10, CIFAR-100, WikiText-103) demonstrate that both IGLU and IGLU-Approx match or exceed ReLU and GELU, with pronounced advantages emerging in heavy-tailed or class-imbalanced domains (Kang et al., 6 Mar 2026).

6. Practical Implementations and Efficiency

Standard GLUs require two full-rank weight matrices, which doubles the memory load compared to classic FFNs, with each matrix read dominating runtime in bandwidth-bound regimes. MGLUs improve efficiency via learned binary masks over a shared matrix, reducing inference-time weight reads by nearly half and achieving up to 1.51× (single-mask) or 19.7× (vs. naïve kernel) speedup on large LLMs without compromising accuracy (Tajima et al., 29 Jun 2025). Multi-mask MGLUs permit additional routing capacity with negligible parameter and memory cost, positioning GLUs as hardware-performant primitives for future architectures.

GLU-Attention integrates the GLU mechanism into value projections in self-attention, preserving parameter and asymptotic computational budgets by appropriately re-sharding weights (Wang, 16 Jun 2025). This intervention is compatible with FlashAttention, RoPE, and multi-query variants, and can be implemented as a drop-in replacement in state-of-the-art Transformer codebases.

7. Impact, Limitations, and Future Trajectories

GLUs, by virtue of their piecewise-quadratic expressivity, enable more rapid error decline in large-scale function approximation tasks and empirically deliver superior generalization and convergence across modalities. They underlie architectural choices in modern LLMs and recommendation models and illuminate a principled path for further improvements via polynomial-order gating (e.g., cubic GQUs) (Queiruga, 16 Feb 2026).

A practical implication is that model scaling exponents can be predicted a priori by tracking the polynomial order induced by the gating structure. This suggests that architecture search can be numerically guided rather than empirically driven alone.

Limitations include the increased memory and computational demands of classic GLUs—mostly mitigated by masked and fused variants—and the fact that observed asymptotic scaling can be dampened in high-dimensional or non-smooth real-world data regimes (Queiruga, 16 Feb 2026, Tajima et al., 29 Jun 2025). Interactions with attention, embeddings, and normalization components influence effective scaling exponents and must be empirically tested per use case.

Future work points toward higher-order gating extensions, principled convergence-based architecture design in higher dimensions, and broader validations across tasks and modalities (Queiruga, 16 Feb 2026, Kang et al., 6 Mar 2026, Tajima et al., 29 Jun 2025). The gating paradigm, interpreted through the lens of functional approximation theory, provides a compelling theoretical and practical foundation for the continued evolution of neural architectures.

Markdown Report Issue Upgrade to Chat

References (7)

Language Modeling with Gated Convolutional Networks (2016)

GLU Variants Improve Transformer (2020)

IGLU: The Integrated Gaussian Linear Unit Activation Function (2026)

Masked Gated Linear Unit (2025)

Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs (2026)

Decoupling Gating from Linearity (2019)

GLU Attention Improve Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Linear Units (GLUs).