GLU-style MLP: Gating in Deep Learning

Updated 22 June 2026

GLU-style MLPs are neural architectures that use a two-branch gating mechanism to dynamically modulate feature flow and improve optimization in deep learning models.
They employ one branch to compute linear projections and another to generate gates, enabling dynamic feature selection, efficient scaling, and enhanced performance.
Enhanced variants like ReGLU, GEGLU, and SwiGLU offer practical improvements in optimization speed, sparsity, and interpretability over conventional single-branch feed-forward networks.

A Gated Linear Unit (GLU)-style MLP is a feed-forward neural network architecture that replaces standard single-branch nonlinearities with a two-branch structure: one linear projection is interpreted as a “gate,” modulating the flow of information from the other (“value”) projection via an element-wise product, optionally followed by a nonlinear activation. Introduced originally in convolutional sequence models, GLU-style MLPs now represent a widely adopted core in transformers, LLMs, vision transformers, interpretable models, and high-efficiency neural architectures. Key instantiations include the basic GLU, variants such as ReGLU, GEGLU, SwiGLU, state-conditional PolyGLU, and bilinear or mask-compressed forms. This architectural family underpinning the FFN block in modern deep learning models delivers improved optimization, stronger scaling laws, dynamic feature selection, and unique opportunities for sparsity and interpretability through its gating structure.

1. Mathematical Definition and Variants

A standard GLU-style MLP replaces the conventional “single linear + activation” FFN by splitting the expansion into two (or more) branches. For vector input $x\in\mathbb{R}^d$ , GLU activations take the generic form: $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ where $W_1, W_2 \in \mathbb{R}^{m \times d}$ , $b_1, b_2\in \mathbb{R}^m$ , $\sigma$ is a gating function (e.g., sigmoid), and $\odot$ denotes element-wise multiplication (Shazeer, 2020, Saoud et al., 2023).

Major variants include:

GLU (original): $\sigma$ is sigmoid.
ReGLU: $\sigma(x) = \max(0, x)$ .
GEGLU: $\sigma(x) = \text{GELU}(x)$ .
SwiGLU: $\sigma(x) = x\,\text{sigmoid}(x)$ (i.e., SiLU/Swish).
Bilinear: drop $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 0; $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 1 (Pearce et al., 2024, Pearce et al., 2024).
PolyGLU: learn dynamic mixtures of K possible nonlinearities per neuron via Gumbel-Softmax routing (Medeiros, 7 Mar 2026).
Masked/Compressed GLUs: exploit learned masks to reduce parameter/memory count (e.g., MGLU) (Tajima et al., 29 Jun 2025).

Feed-forward expansion typically increases hidden channels (e.g., model dimension $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 2 to feedforward dimension $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 3), then projects back.

2. Structural Integration in Modern Architectures

In transformers and advanced deep learning models, GLU-style MLPs are used as direct replacements for conventional FFN blocks:

Standard Transformer: FFN is $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 4.
GLU-enhanced FFN: $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 5 (Shazeer, 2020, Abdullah et al., 2024).

GLU-style MLPs are typically interleaved with residual connections and normalization:

Pre-normalize input, apply GLU, apply dropout, apply residual/add, repeat for subsequent FFN or another GLU block (Saoud et al., 2023, Abdullah et al., 2024).
In GLU+Residual hybrids (“RankGLU”, “Gated Residual Network”), a linear path is added directly to the gated path, allowing a stable direct information flow (“residual route”) and a bounded nonlinear correction (Xiao et al., 8 Jun 2026, Le et al., 2024).

Mixed-expert architectures (e.g., MoE) further leverage fine-grained activation patterns from GLUs to partition units into shared vs. specialist experts, providing model-efficient “blueprints” for downstream MoE conversion (Zhao et al., 17 Feb 2026).

3. Functional Properties: Scaling, Capacity, and Optimization

Expressivity and Scaling Laws

GLU-style MLPs introduce a piecewise quadratic nonlinearity; with ReLU gates, each neuron implements a quadratic spline over its input. This results in fundamentally superior approximation properties: $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 6 where $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 7 is RMSE with $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 8 parameters (Queiruga, 16 Feb 2026). Gated Quadratic Units (GQUs) can push the scaling slope even further, empirically $y = (W_1 x + b_1) \odot \sigma(W_2 x + b_2),$ 9 (Queiruga, 16 Feb 2026). This reflects a qualitative difference: GLUs are “outer product” architectures capturing all pairwise hidden interactions.

Spectrum and Optimization

Analyses in the NTK regime show that GLU gating contracts the spectrum of the neural tangent kernel compared to non-gated MLPs: the condition number is reduced by a factor $W_1, W_2 \in \mathbb{R}^{m \times d}$ 0, resulting in a more compact eigenvalue spread and significantly faster asymptotic convergence (Lyu et al., 20 May 2026). Early in training, non-GLU models may match or exceed GLU convergence, but GLUs overtake in the tail, producing a “loss crossing” effect.

GLU does not necessarily improve generalization gap over non-gated architectures—the main observed benefit is in optimization speed and stability (Lyu et al., 20 May 2026).

4. Architectural and Activation Trade-offs

Gating Nonlinearity

Empirical benchmarks demonstrate all major GLU-style gates (sigmoid, ReLU, GELU, SiLU/Swish) outperform single-branch ReLU/GELU/SwiGLU on pretraining loss, downstream accuracy, and information-theoretic metrics (Shazeer, 2020, Le et al., 2024).

Gate Variant	Best Pretraining Perplexity	Best Fine-tune Task	Notes
GEGLU	Yes	GLUE, SuperGLUE, SQuAD	Strong overall, drop-in
SwiGLU	Slightly	SuperGLUE	Robust, preferred for some
ReGLU	Simple	GLUE	Simpler, matches GEGLU
Bilinear	Competitive	-	Best for interpretability

Nonlinearity choice interacts with statistical properties of the task: bounded (sigmoid) gates are preferable under noisy ranking or cross-sectional regimes (Xiao et al., 8 Jun 2026).

Enhanced Gates

Expanded gating ranges (e.g., trainable $W_1, W_2 \in \mathbb{R}^{m \times d}$ 1 to allow gates beyond $W_1, W_2 \in \mathbb{R}^{m \times d}$ 2) further improve gradient flow and reduce perplexity, even closing the gap between simple GLU and second-order gates (SwiGLU, GEGLU) (Huang, 2024).

State-conditional activation routing (PolyGLU) allows neurons to select among multiple activation functions at runtime, with emergent specialization and negligible overhead (Medeiros, 7 Mar 2026).

5. Computational Efficiency and Hardware-Aware Design

GLU-style MLPs deliver both computational expressivity and competitive efficiency:

Memory footprint: Standard GLU doubles matrix reads (gate+value), but MGLU compresses both into a single weight matrix plus binary mask(s), yielding 47% lower memory transfer and up to $W_1, W_2 \in \mathbb{R}^{m \times d}$ 3 inference speed-up on RTX5090 (FlashMGLU) (Tajima et al., 29 Jun 2025).
FLOPs: Theoretical operation counts remain $W_1, W_2 \in \mathbb{R}^{m \times d}$ 4 for $W_1, W_2 \in \mathbb{R}^{m \times d}$ 5 tokens, $W_1, W_2 \in \mathbb{R}^{m \times d}$ 6 input, $W_1, W_2 \in \mathbb{R}^{m \times d}$ 7 expansion, with little overhead compared to single-branch MLPs.
Latency: Linear complexity in sequence length vs. quadratic for self-attention allows GLU-based Vision Transformers (“Activator”) to match or surpass ViT accuracy while being more suitable for edge deployment (Abdullah et al., 2024).
Sparsity: GLU intermediate activations have highly non-uniform group-norms; dependency-aware semi-structured sparsity (DaSS) uses these to prioritize unstructured pruning while aligning with hardware N:M constraints, outperforming SparseGPT and Wanda on LLaMA2/Mistral (Guo et al., 2024).

6. Interpretability and Analysis via Bilinear Decomposition

Bilinear MLPs—GLU variants dropping the nonlinearity—can be fully recast as a third-order tensor contraction: $W_1, W_2 \in \mathbb{R}^{m \times d}$ 8 This allows eigenvalue/SVD decomposition of MLP weights directly: the top eigenvectors correspond to interpretable features (e.g., digit components, semantic circuits), and truncation to top modes yields negligible loss in predictive power (Pearce et al., 2024, Pearce et al., 2024).

Mechanistic interpretability is further enhanced by:

Identifying “circuit” structure (e.g., sentiment/negation AND-gates).
Fast extraction of influential interaction patterns.
Pruning eigenfeatures by importance.

Fine-tuning conventional SiLU-based transformers to bilinear activations via Swish-annealing preserves performance (Pearce et al., 2024).

7. Practical Guidelines, Applications, and Ablation Insights

Residual path necessity: Always sum a direct linear path with the nonlinear gated path for stable ordering and robust optimization, as excessive nonlinearity can destabilize in low-signal or low-data regimes (Xiao et al., 8 Jun 2026).
Gating width/bottleneck: Restrict bottleneck size to control capacity and variance for small-scale or ranking tasks (e.g., $W_1, W_2 \in \mathbb{R}^{m \times d}$ 9).
Activation normalization: Preceding the gate with LayerNorm stabilizes training and aligns gating with final task metrics.
Hyperparameter ratios: Parameter and FLOPs budget matched to ReLU FFN by using $b_1, b_2\in \mathbb{R}^m$ 0 for GLU-style blocks (Shazeer, 2020).
Data-scarce/Noisy scenarios: GLU MLPs, especially with sigmoid gating, raise mutual information between features and labels (as verified by MINE), yielding strong performance in low-data conditions (Le et al., 2024).
Vision and sequence modeling: GEGLU-only vision transformer blocks (“Activator”) achieve higher accuracy and ~40% parameter/FLOP reduction vs. attention-based blocks on CIFAR-10/100 (Abdullah et al., 2024).
MoE conversion: Intrinsic activation patterns of GLU MLPs naturally reveal universal vs. specialized neurons, enabling robust zero-shot partitioning for MoE instantiation (“ExpertWeaver”) (Zhao et al., 17 Feb 2026).

References:

(Shazeer, 2020) GLU Variants Improve Transformer
(Saoud et al., 2023) Improving Knee Joint Angle Prediction through Dynamic Contextual Focus and Gated Linear Units
(Xiao et al., 8 Jun 2026) RankGLU: Residual Gated Score Formation for Cross-Sectional Stock Prediction
(Huang, 2024) Expanded Gating Ranges Improve Activation Functions
(Queiruga, 16 Feb 2026) Divine Benevolence is an $b_1, b_2\in \mathbb{R}^m$ 1: GLUs scale asymptotically faster than MLPs
(Lyu et al., 20 May 2026) The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
(Pearce et al., 2024) Bilinear MLPs enable weight-based mechanistic interpretability
(Pearce et al., 2024) Weight-based Decomposition: A Case for Bilinear MLPs
(Guo et al., 2024) Dependency-Aware Semi-Structured Sparsity of GLU Variants in LLMs
(Tajima et al., 29 Jun 2025) Masked Gated Linear Unit
(Medeiros, 7 Mar 2026) PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks
(Abdullah et al., 2024) Activator: GLU Activation Function as the Core Component of a Vision Transformer
(Le et al., 2024) Transformer Meets Gated Residual Networks To Enhance Photoplethysmogram Artifact Detection Informed by Mutual Information Neural Estimation
(Zhao et al., 17 Feb 2026) ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
(Fiat et al., 2019) Decoupling Gating from Linearity