Gated Quadratic Unit (GQU)
- Gated Quadratic Units are modules that combine gate and value streams using a single full-rank weight matrix partitioned by learned binary masks.
- The architecture enables expressive multiplicative interactions while significantly reducing memory bandwidth and latency through optimized kernels like FlashMGLU.
- Swish-activated variants such as SwiMGLU enhance gradient flow and achieve competitive performance with up to 19.7× faster inference compared to naive implementations.
A Gated Quadratic Unit (commonly, Gated Linear Unit or GLU in the literature) is a neural network module enabling expressive multiplicative interactions between learned "gate" and "value" streams. Standard GLUs, now routine in LLMs, rely on two independent weight matrices and apply a gating nonlinearity to one stream before element-wise multiplying the result with the other. To address the associated bottleneck of doubled memory reads and parameter counts, recent work has introduced Masked Gated Linear Units (MGLUs), which recover GLU-like flexibility but consolidate the gate and value projections into a single full-rank weight matrix, partitioned by learned binary element-wise masks. This design allows for significantly reduced inference-time memory bandwidth, hardware-optimized execution via the FlashMGLU kernel, and competitive or improved downstream model quality, particularly in Swish-activated variants (SwiMGLU) (Tajima et al., 29 Jun 2025).
1. Standard Gated Linear Units and Their Bottlenecks
A standard GLU operates on an input as:
where are dense weight matrices and is a gating nonlinearity (e.g., sigmoid or Swish). This architecture enables multiplicative gating but at the cost of increased memory traffic: at inference, both and (each of size in FP16) must be fetched from memory—doubling the load compared to single-projection non-gated layers (e.g., GELU-activated FFN) (Tajima et al., 29 Jun 2025).
2. MGLU Architecture and Mixture of Element-wise Gating
MGLUs replace the two-projection GLU mechanism with a single projection and a learnable binary mask . The mask determines at the element level which entries of contribute to the gate versus the value computation. The MGLU output is:
where . During training, M is parameterized via real-valued logits and binarized by a straight-through estimator (STE), maintaining differentiability. To scale model capacity, a "Mixture of Element-wise Gating" (MoEG) generalizes the mechanism, learning complementary masks and summing over the corresponding routed outputs:
All routes share while each carves out a distinct gate/value subspace, enabling adaptive allocation of gating capacity (Tajima et al., 29 Jun 2025).
3. FlashMGLU: Hardware-Optimized Kernel Implementation
A naive MGLU implementation would still require matrix-vector products for each mask and its complement, exacerbating memory traffic, especially on GPUs. The FlashMGLU kernel fuses all gating routes into a single split-K matvec, optimizing both memory access and computation:
- Packs all mask bits per matrix entry into a single integer for efficient loading.
- Chunks and input for buffer-local operations.
- Performs on-chip accumulation of ungated and gated partial sums.
- Reduces results across threads and flushes outputs in a single coalesced memory write.
This approach ensures each weight and mask entry is accessed once per token, achieving a significant reduction in global memory traffic and improved arithmetic intensity. FlashMGLU realizes up to 19.7× speed-up over naive PyTorch MGLU and remains faster than standard GLU, even for multiple masks (Tajima et al., 29 Jun 2025).
4. Quantitative Analysis: Memory Bandwidth, Latency, and Parameter Efficiency
Memory Bandwidth and Parameter Counts
| Layer Type | FP16 Weight Load (bits) | Additional Mask (bits) | Total | Relative Reduction (vs. GLU) |
|---|---|---|---|---|
| GLU | 0 | Baseline | ||
| GELU/LU | 0 | 50% fewer | ||
| MGLU |
For , MGLU cuts memory reads by 47%; for , the reduction is 37.5%. Parameter counts similarly drop, with masks parameterized as FP16 logits only during training (Tajima et al., 29 Jun 2025).
Inference Latency
Empirical evaluation (RTX 5090, , , ) reports:
- Naive PyTorch MGLU: 0.5210 ms
- Triton MGLU: 0.0834 ms (6.25× speed-up)
- FlashMGLU: 0.0265 ms (19.66× speed-up)
- Standard GLU (PyTorch): 0.0306 ms
FlashMGLU maintains speed advantages across higher and model scales (e.g., for , , still >18× faster than PyTorch MGLU and 1.24× faster than GLU) (Tajima et al., 29 Jun 2025).
5. SwiMGLU: Swish-activated Masked Gated Linear Units
All experimental variants utilize the Swish gating function:
This choice, determined via automated search, provides superior gradient flow and downstream performance relative to other gating activations (e.g., sigmoid, GELU). SwiMGLU is constructed by replacing each SwiGLU block's two projections with a single masked-up projection. FlashMGLU's fusion ensures that even with the mask overhead, SwiMGLU achieves up to 1.51× speed-up over GLU with , remains faster for up to , and reduces up-projection compute and memory read by 29.1% and 37.5% respectively for typical settings (Tajima et al., 29 Jun 2025).
6. Empirical Results: Model Quality and Efficiency
Experimental studies using Llama-3-style decoder-only models replacing SwiGLU with SwiMGLU (with ) demonstrate:
- For small models (12L, , ), SwiMGLU achieves 46.48% zero-shot average, surpassing both GELU-FFN and SwiGLU.
- For large models (16L, , ), SwiMGLU yields 56.85%, exceeding SwiGLU's 56.00%.
Perplexities and two-shot evaluations are consistent: SwiMGLU matches or slightly exceeds SwiGLU's quality while reducing full-rank parameter count by 25% and inference memory by approximately 37.5% (Tajima et al., 29 Jun 2025).
7. Conclusion and Significance
The MGLU and its Swish-activated variant SwiMGLU consolidate gate and value projections into a single parameter matrix and binary elementwise masks, attaining GLU-level expressivity with drastically reduced inference memory load and latency. The MoEG extension provides scalable expressiveness without increasing memory bottlenecks, while the FlashMGLU kernel ensures efficient utilization of hardware resources. SwiMGLU maintains or improves downstream performance—matching or surpassing SwiGLU baselines—while reducing per-token bandwidth by up to 47% and latency by up to 34% versus standard GLUs or nearly 20× compared to naive MGLU implementations (Tajima et al., 29 Jun 2025).