Papers
Topics
Authors
Recent
Search
2000 character limit reached

Masked Gated Linear Unit (MGLU)

Updated 24 March 2026
  • Masked Gated Linear Unit (MGLU) is a neural network technique that consolidates dual projection matrices into a single weight matrix using binary masks.
  • It applies a mixture of element-wise gating to independently route gate and value streams, significantly reducing memory traffic during inference.
  • Custom kernel optimizations like FlashMGLU demonstrate notable speedups and improved throughput in LLMs, balancing efficiency with downstream accuracy.

Masked Gated Linear Unit (MGLU) is a neural network architectural variant designed to increase the computational and memory efficiency of gated feed-forward layers in LLMs, while retaining or surpassing the downstream accuracy of conventional Gated Linear Units (GLUs). By introducing mask-based elementwise routing over a single weight matrix, MGLU decouples the gate and value streams traditionally maintained with separate matrices, thereby significantly reducing inference-time memory traffic and enabling highly efficient custom kernels (Tajima et al., 29 Jun 2025).

1. Background and Motivation

Gated Linear Units (GLUs) augment standard feed-forward networks by replacing a single activation with a two-stream gating mechanism. For input x∈Rh\bm{x} \in \mathbb{R}^h, the GLU applies two linear projections, xWg\bm{x}W_g and xWv\bm{x}W_v, modulated via a nonlinearity g(⋅)g(\cdot):

GLU(x)=g(xWg)⊙(xWv)\mathrm{GLU}(\bm{x}) = g(\bm{x} W_g) \odot (\bm{x} W_v)

This configuration, prevalent in state-of-the-art LLMs, doubles the memory footprint and bandwidth required for the intermediate projection weights since both WgW_g and WvW_v must be loaded independently per inference token. The resulting $2hd$ FP16 memory reads per layer (for hidden size hh and projection dimension dd) dominate compute cost in single-batch LLM inference (Tajima et al., 29 Jun 2025).

MGLU addresses this bottleneck by consolidating the two projection matrices into a single weight matrix, exploiting binary masks to recover independent gate/value routing. This design maintains the expressivity of GLUs while mitigating weight load redundancy.

2. Architectural Formulation

The key innovation in MGLU is the Mixture of Element-wise Gating (MoEG) paradigm. Instead of maintaining separate WgW_g and WvW_v, MGLU leverages:

  • A single weight matrix W∈Rh×dW \in \mathbb{R}^{h \times d}
  • A set of nmn_m learned binary masks {Mi∈{0,1}h×d}i=1nm\{M_i \in \{0,1\}^{h \times d}\}_{i=1}^{n_m}, with complementary masks M‾i=1−Mi\overline{M}_i = \bm{1} - M_i

Each mask MiM_i demarcates gate/value assignments at the parameter level for "route" ii. The MGLU operator for the single-mask case (nm=1n_m=1) is:

MGLU1(x)=g(x(M⊙W))⊙(x(M‾⊙W))\mathrm{MGLU}_{1}(\bm{x}) = g\left(\bm{x}(M \odot W)\right) \odot \left(\bm{x}(\overline{M} \odot W)\right)

For the general MoEG case with nmn_m routes:

MGLUnm(x)=∑i=1nmg(x(Mi⊙W))⊙(x(M‾i⊙W))\mathrm{MGLU}_{n_m}(\bm{x}) = \sum_{i=1}^{n_m} g\left(\bm{x}(M_i \odot W)\right) \odot \left(\bm{x}(\overline{M}_i \odot W)\right)

At training, the masks are optimized jointly with WW using a straight-through estimator; at inference, they are frozen bit-patterns.

3. Memory, Parameterization, and Efficiency

The introduction of masking allows MGLU to match the parameter count and inference-time memory load of a standard Linear Unit (LU) layer more closely, despite retaining gating expressivity. The following table provides the per-layer parameter and memory-access costs, excerpted from (Tajima et al., 29 Jun 2025):

Layer Type #FP16 Params #Binary Params Mem. Load per Token (bits)
LU (GELU) h dh\,d $0$ 16 h d16\,h\,d
GLU 2 h d2\,h\,d $0$ 32 h d32\,h\,d
MGLU (nmn_m masks) h dh\,d nm h dn_m\,h\,d (16+nm) h d(16 + n_m)\,h\,d

For nm=1n_m = 1, MGLU requires 17 h d17\,h\,d bits per token—47% less than standard GLU (32 h d32\,h\,d). All gate/value routing information is encoded via bitwise operations over compact binary masks.

4. Kernel Optimization and FlashMGLU

A hardware-centric kernel, FlashMGLU, is introduced to exploit MGLU's compressed representation. Key implementation strategies include:

  • Packing nmn_m binary mask bits per weight entry into a single 8-bit integer, enabling multi-route mask access with a single read.
  • Tiling WW and x\bm{x} along the matrix's "K" dimension for coalesced memory operations.
  • On-chip computation of all gated and ungated dot products using fused CUDA or Triton kernels.

Measured on an RTX 5090, FlashMGLU achieves a 19.7× speed-up (for nm=8n_m=8) relative to naive PyTorch MGLU (0.0265 ms vs 0.5210 ms), and a 1.51× improvement over standard GLU (0.0202 ms vs 0.0306 ms at nm=1n_m=1). The memory bandwidth required for intermediate FFN weights per layer is reduced from 2×2 \timesFP16 arrays in SwiGLU to 1×1 \timesFP16 +nm+ n_m bits in SwiMGLU.

5. Empirical Evaluation in LLMs

Comprehensive ablations replace all SwiGLU FFNs in Llama-3-style decoders with SwiMGLU, trained on FineWeb-Edu. Results on six zero- and two-shot evaluation benchmarks show the following (selected):

Variant #Weights Avg. Accuracy (Small, 159M) Avg. Accuracy (Large, 1.08B)
SwiGLU 141M 46.20% 56.00%
SwiMGLU (nm=1n_m=1) 113M 46.06% —
SwiMGLU (nm=4n_m=4) 113M 46.48% 56.85%
SwiMGLU (nm=8n_m=8) 113M 46.49% —

SwiMGLU achieves accuracy that matches or slightly surpasses SwiGLU for nm≥4n_m \geq 4, with a 19–33% reduction in model storage. End-to-end LLM inference throughput rises by 1.3× in token-decode latency for nm=1n_m=1, with speed gains persisting up to nm=8n_m=8.

6. Trade-offs, Sweet Spots, and Limitations

Increasing nmn_m improves accuracy slightly, but linearly increases the binary parameter count and the cost of mask learning during training. However, inference remains memory-bound; thus, the reduction in FP16 weight transfer dominates efficiency gains. In practice, nm=4n_m=4 is identified as the empirical "sweet spot" balancing accuracy and minimal latency overhead. Because the masks are binary and fixed at inference, no sampling variance or stochastic routing overhead is introduced during deployment.

7. Comparative Analysis and Impact

MGLUs generalize the expressivity of two-matrix GLUs while equalling or exceeding performance at a sharply lower memory footprint. The approach is agnostic to activation choice, with Swish activations (SwiMGLU) demonstrating competitive or superior results on downstream benchmarks. The kernel-level advantages realized by FlashMGLU make MGLU especially suitable for memory-bound, single-batch inference workloads in modern LLMs, where memory traffic, not arithmetic, is the primary bottleneck (Tajima et al., 29 Jun 2025).

No recognized studies prior to (Tajima et al., 29 Jun 2025) define or implement a mask-based GLU variant; MGLU represents the first such architecture with end-to-end evaluation and custom kernel realization for LLM-scale models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Masked Gated Linear Unit (MGLU).