Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated Quadratic Unit (GQU)

Updated 24 March 2026
  • Gated Quadratic Units are modules that combine gate and value streams using a single full-rank weight matrix partitioned by learned binary masks.
  • The architecture enables expressive multiplicative interactions while significantly reducing memory bandwidth and latency through optimized kernels like FlashMGLU.
  • Swish-activated variants such as SwiMGLU enhance gradient flow and achieve competitive performance with up to 19.7× faster inference compared to naive implementations.

A Gated Quadratic Unit (commonly, Gated Linear Unit or GLU in the literature) is a neural network module enabling expressive multiplicative interactions between learned "gate" and "value" streams. Standard GLUs, now routine in LLMs, rely on two independent weight matrices and apply a gating nonlinearity to one stream before element-wise multiplying the result with the other. To address the associated bottleneck of doubled memory reads and parameter counts, recent work has introduced Masked Gated Linear Units (MGLUs), which recover GLU-like flexibility but consolidate the gate and value projections into a single full-rank weight matrix, partitioned by learned binary element-wise masks. This design allows for significantly reduced inference-time memory bandwidth, hardware-optimized execution via the FlashMGLU kernel, and competitive or improved downstream model quality, particularly in Swish-activated variants (SwiMGLU) (Tajima et al., 29 Jun 2025).

1. Standard Gated Linear Units and Their Bottlenecks

A standard GLU operates on an input x∈Rhx \in \mathbb{R}^h as:

GLU(x)=g(xWg)⊙(xWv)\mathrm{GLU}(x) = g(x W_g) \odot (x W_v)

where Wg,Wv∈Rh×dW_g, W_v \in \mathbb{R}^{h \times d} are dense weight matrices and g(⋅)g(\cdot) is a gating nonlinearity (e.g., sigmoid or Swish). This architecture enables multiplicative gating but at the cost of increased memory traffic: at inference, both WgW_g and WvW_v (each of size h×dh \times d in FP16) must be fetched from memory—doubling the load compared to single-projection non-gated layers (e.g., GELU-activated FFN) (Tajima et al., 29 Jun 2025).

2. MGLU Architecture and Mixture of Element-wise Gating

MGLUs replace the two-projection GLU mechanism with a single projection W∈Rh×dW \in \mathbb{R}^{h\times d} and a learnable binary mask M∈{0,1}h×dM \in \{0,1\}^{h\times d}. The mask determines at the element level which entries of WW contribute to the gate versus the value computation. The MGLU output is:

MGLU1(x)=g(x(M⊙W))⊙(x(M‾⊙W))\mathrm{MGLU}_1(x) = g(x (M \odot W)) \odot (x (\overline{M} \odot W))

where M‾=1−M\overline{M} = 1 - M. During training, M is parameterized via real-valued logits and binarized by a straight-through estimator (STE), maintaining differentiability. To scale model capacity, a "Mixture of Element-wise Gating" (MoEG) generalizes the mechanism, learning nmn_m complementary masks {Mi}\{M_i\} and summing over the corresponding routed outputs:

MGLUnm(x)=∑i=1nmg(x(Mi⊙W))⊙(x((1−Mi)⊙W))\mathrm{MGLU}_{n_m}(x) = \sum_{i=1}^{n_m} g(x (M_i \odot W)) \odot (x ((1-M_i)\odot W))

All routes share WW while each MiM_i carves out a distinct gate/value subspace, enabling adaptive allocation of gating capacity (Tajima et al., 29 Jun 2025).

3. FlashMGLU: Hardware-Optimized Kernel Implementation

A naive MGLU implementation would still require 2nm2 n_m matrix-vector products for each mask and its complement, exacerbating memory traffic, especially on GPUs. The FlashMGLU kernel fuses all gating routes into a single split-K matvec, optimizing both memory access and computation:

  • Packs all nmn_m mask bits per matrix entry into a single integer for efficient loading.
  • Chunks WW and input xx for buffer-local operations.
  • Performs on-chip accumulation of ungated and gated partial sums.
  • Reduces results across threads and flushes outputs in a single coalesced memory write.

This approach ensures each weight and mask entry is accessed once per token, achieving a significant reduction in global memory traffic and improved arithmetic intensity. FlashMGLU realizes up to 19.7× speed-up over naive PyTorch MGLU and remains faster than standard GLU, even for multiple masks (Tajima et al., 29 Jun 2025).

4. Quantitative Analysis: Memory Bandwidth, Latency, and Parameter Efficiency

Memory Bandwidth and Parameter Counts

Layer Type FP16 Weight Load (bits) Additional Mask (bits) Total Relative Reduction (vs. GLU)
GLU 32 hd32\ h d 0 32 hd32\ h d Baseline
GELU/LU 16 hd16\ h d 0 16 hd16\ h d 50% fewer
MGLU nmn_m 16 hd16\ h d nm hdn_m\ h d (16+nm) hd(16 + n_m)\ h d 1−(16+nm)/321 - (16 + n_m)/32

For nm=1n_m=1, MGLU cuts memory reads by 47%; for nm=4n_m=4, the reduction is 37.5%. Parameter counts similarly drop, with masks parameterized as FP16 logits only during training (Tajima et al., 29 Jun 2025).

Inference Latency

Empirical evaluation (RTX 5090, h=8192h=8192, d=2048d=2048, nm=8n_m=8) reports:

  • Naive PyTorch MGLU: 0.5210 ms
  • Triton MGLU: 0.0834 ms (6.25× speed-up)
  • FlashMGLU: 0.0265 ms (19.66× speed-up)
  • Standard GLU (PyTorch): 0.0306 ms

FlashMGLU maintains speed advantages across higher nmn_m and model scales (e.g., for h=14336h=14336, d=4096d=4096, still >18× faster than PyTorch MGLU and 1.24× faster than GLU) (Tajima et al., 29 Jun 2025).

5. SwiMGLU: Swish-activated Masked Gated Linear Units

All experimental variants utilize the Swish gating function:

g(z)=zâ‹…sigmoid(z)g(z) = z \cdot \mathrm{sigmoid}(z)

This choice, determined via automated search, provides superior gradient flow and downstream performance relative to other gating activations (e.g., sigmoid, GELU). SwiMGLU is constructed by replacing each SwiGLU block's two projections with a single masked-up projection. FlashMGLU's fusion ensures that even with the mask overhead, SwiMGLU achieves up to 1.51× speed-up over GLU with nm=1n_m=1, remains faster for up to nm=8n_m=8, and reduces up-projection compute and memory read by 29.1% and 37.5% respectively for typical settings (Tajima et al., 29 Jun 2025).

6. Empirical Results: Model Quality and Efficiency

Experimental studies using Llama-3-style decoder-only models replacing SwiGLU with SwiMGLU (with nm=4n_m=4) demonstrate:

  • For small models (12L, h=768h=768, d=3072d=3072), SwiMGLU achieves 46.48% zero-shot average, surpassing both GELU-FFN and SwiGLU.
  • For large models (16L, h=2048h=2048, d=8192d=8192), SwiMGLU yields 56.85%, exceeding SwiGLU's 56.00%.

Perplexities and two-shot evaluations are consistent: SwiMGLU matches or slightly exceeds SwiGLU's quality while reducing full-rank parameter count by 25% and inference memory by approximately 37.5% (Tajima et al., 29 Jun 2025).

7. Conclusion and Significance

The MGLU and its Swish-activated variant SwiMGLU consolidate gate and value projections into a single parameter matrix and binary elementwise masks, attaining GLU-level expressivity with drastically reduced inference memory load and latency. The MoEG extension provides scalable expressiveness without increasing memory bottlenecks, while the FlashMGLU kernel ensures efficient utilization of hardware resources. SwiMGLU maintains or improves downstream performance—matching or surpassing SwiGLU baselines—while reducing per-token bandwidth by up to 47% and latency by up to 34% versus standard GLUs or nearly 20× compared to naive MGLU implementations (Tajima et al., 29 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Quadratic Unit (GQU).