Masked Gated Linear Unit (MGLU)
- Masked Gated Linear Unit (MGLU) is a neural network technique that consolidates dual projection matrices into a single weight matrix using binary masks.
- It applies a mixture of element-wise gating to independently route gate and value streams, significantly reducing memory traffic during inference.
- Custom kernel optimizations like FlashMGLU demonstrate notable speedups and improved throughput in LLMs, balancing efficiency with downstream accuracy.
Masked Gated Linear Unit (MGLU) is a neural network architectural variant designed to increase the computational and memory efficiency of gated feed-forward layers in LLMs, while retaining or surpassing the downstream accuracy of conventional Gated Linear Units (GLUs). By introducing mask-based elementwise routing over a single weight matrix, MGLU decouples the gate and value streams traditionally maintained with separate matrices, thereby significantly reducing inference-time memory traffic and enabling highly efficient custom kernels (Tajima et al., 29 Jun 2025).
1. Background and Motivation
Gated Linear Units (GLUs) augment standard feed-forward networks by replacing a single activation with a two-stream gating mechanism. For input , the GLU applies two linear projections, and , modulated via a nonlinearity :
This configuration, prevalent in state-of-the-art LLMs, doubles the memory footprint and bandwidth required for the intermediate projection weights since both and must be loaded independently per inference token. The resulting $2hd$ FP16 memory reads per layer (for hidden size and projection dimension ) dominate compute cost in single-batch LLM inference (Tajima et al., 29 Jun 2025).
MGLU addresses this bottleneck by consolidating the two projection matrices into a single weight matrix, exploiting binary masks to recover independent gate/value routing. This design maintains the expressivity of GLUs while mitigating weight load redundancy.
2. Architectural Formulation
The key innovation in MGLU is the Mixture of Element-wise Gating (MoEG) paradigm. Instead of maintaining separate and , MGLU leverages:
- A single weight matrix
- A set of learned binary masks , with complementary masks
Each mask demarcates gate/value assignments at the parameter level for "route" . The MGLU operator for the single-mask case () is:
For the general MoEG case with routes:
At training, the masks are optimized jointly with using a straight-through estimator; at inference, they are frozen bit-patterns.
3. Memory, Parameterization, and Efficiency
The introduction of masking allows MGLU to match the parameter count and inference-time memory load of a standard Linear Unit (LU) layer more closely, despite retaining gating expressivity. The following table provides the per-layer parameter and memory-access costs, excerpted from (Tajima et al., 29 Jun 2025):
| Layer Type | #FP16 Params | #Binary Params | Mem. Load per Token (bits) |
|---|---|---|---|
| LU (GELU) | $0$ | ||
| GLU | $0$ | ||
| MGLU ( masks) |
For , MGLU requires bits per token—47% less than standard GLU (). All gate/value routing information is encoded via bitwise operations over compact binary masks.
4. Kernel Optimization and FlashMGLU
A hardware-centric kernel, FlashMGLU, is introduced to exploit MGLU's compressed representation. Key implementation strategies include:
- Packing binary mask bits per weight entry into a single 8-bit integer, enabling multi-route mask access with a single read.
- Tiling and along the matrix's "K" dimension for coalesced memory operations.
- On-chip computation of all gated and ungated dot products using fused CUDA or Triton kernels.
Measured on an RTX 5090, FlashMGLU achieves a 19.7× speed-up (for ) relative to naive PyTorch MGLU (0.0265 ms vs 0.5210 ms), and a 1.51× improvement over standard GLU (0.0202 ms vs 0.0306 ms at ). The memory bandwidth required for intermediate FFN weights per layer is reduced from FP16 arrays in SwiGLU to FP16 bits in SwiMGLU.
5. Empirical Evaluation in LLMs
Comprehensive ablations replace all SwiGLU FFNs in Llama-3-style decoders with SwiMGLU, trained on FineWeb-Edu. Results on six zero- and two-shot evaluation benchmarks show the following (selected):
| Variant | #Weights | Avg. Accuracy (Small, 159M) | Avg. Accuracy (Large, 1.08B) |
|---|---|---|---|
| SwiGLU | 141M | 46.20% | 56.00% |
| SwiMGLU () | 113M | 46.06% | — |
| SwiMGLU () | 113M | 46.48% | 56.85% |
| SwiMGLU () | 113M | 46.49% | — |
SwiMGLU achieves accuracy that matches or slightly surpasses SwiGLU for , with a 19–33% reduction in model storage. End-to-end LLM inference throughput rises by 1.3× in token-decode latency for , with speed gains persisting up to .
6. Trade-offs, Sweet Spots, and Limitations
Increasing improves accuracy slightly, but linearly increases the binary parameter count and the cost of mask learning during training. However, inference remains memory-bound; thus, the reduction in FP16 weight transfer dominates efficiency gains. In practice, is identified as the empirical "sweet spot" balancing accuracy and minimal latency overhead. Because the masks are binary and fixed at inference, no sampling variance or stochastic routing overhead is introduced during deployment.
7. Comparative Analysis and Impact
MGLUs generalize the expressivity of two-matrix GLUs while equalling or exceeding performance at a sharply lower memory footprint. The approach is agnostic to activation choice, with Swish activations (SwiMGLU) demonstrating competitive or superior results on downstream benchmarks. The kernel-level advantages realized by FlashMGLU make MGLU especially suitable for memory-bound, single-batch inference workloads in modern LLMs, where memory traffic, not arithmetic, is the primary bottleneck (Tajima et al., 29 Jun 2025).
No recognized studies prior to (Tajima et al., 29 Jun 2025) define or implement a mask-based GLU variant; MGLU represents the first such architecture with end-to-end evaluation and custom kernel realization for LLM-scale models.