Masked Gated Linear Unit (MGLU)

Updated 24 March 2026

Masked Gated Linear Unit (MGLU) is a neural network technique that consolidates dual projection matrices into a single weight matrix using binary masks.
It applies a mixture of element-wise gating to independently route gate and value streams, significantly reducing memory traffic during inference.
Custom kernel optimizations like FlashMGLU demonstrate notable speedups and improved throughput in LLMs, balancing efficiency with downstream accuracy.

Masked Gated Linear Unit (MGLU) is a neural network architectural variant designed to increase the computational and memory efficiency of gated feed-forward layers in LLMs, while retaining or surpassing the downstream accuracy of conventional Gated Linear Units (GLUs). By introducing mask-based elementwise routing over a single weight matrix, MGLU decouples the gate and value streams traditionally maintained with separate matrices, thereby significantly reducing inference-time memory traffic and enabling highly efficient custom kernels (Tajima et al., 29 Jun 2025).

1. Background and Motivation

Gated Linear Units (GLUs) augment standard feed-forward networks by replacing a single activation with a two-stream gating mechanism. For input $\bm{x} \in \mathbb{R}^h$ , the GLU applies two linear projections, $\bm{x}W_g$ and $\bm{x}W_v$ , modulated via a nonlinearity $g(\cdot)$ :

$\mathrm{GLU}(\bm{x}) = g(\bm{x} W_g) \odot (\bm{x} W_v)$

This configuration, prevalent in state-of-the-art LLMs, doubles the memory footprint and bandwidth required for the intermediate projection weights since both $W_g$ and $W_v$ must be loaded independently per inference token. The resulting $2hd$ FP16 memory reads per layer (for hidden size $h$ and projection dimension $d$ ) dominate compute cost in single-batch LLM inference (Tajima et al., 29 Jun 2025).

MGLU addresses this bottleneck by consolidating the two projection matrices into a single weight matrix, exploiting binary masks to recover independent gate/value routing. This design maintains the expressivity of GLUs while mitigating weight load redundancy.

2. Architectural Formulation

The key innovation in MGLU is the Mixture of Element-wise Gating (MoEG) paradigm. Instead of maintaining separate $W_g$ and $W_v$ , MGLU leverages:

A single weight matrix $W \in \mathbb{R}^{h \times d}$
A set of $n_m$ learned binary masks $\{M_i \in \{0,1\}^{h \times d}\}_{i=1}^{n_m}$ , with complementary masks $\overline{M}_i = \bm{1} - M_i$

Each mask $M_i$ demarcates gate/value assignments at the parameter level for "route" $i$ . The MGLU operator for the single-mask case ( $n_m=1$ ) is:

$\mathrm{MGLU}_{1}(\bm{x}) = g\left(\bm{x}(M \odot W)\right) \odot \left(\bm{x}(\overline{M} \odot W)\right)$

For the general MoEG case with $n_m$ routes:

$\mathrm{MGLU}_{n_m}(\bm{x}) = \sum_{i=1}^{n_m} g\left(\bm{x}(M_i \odot W)\right) \odot \left(\bm{x}(\overline{M}_i \odot W)\right)$

At training, the masks are optimized jointly with $W$ using a straight-through estimator; at inference, they are frozen bit-patterns.

3. Memory, Parameterization, and Efficiency

The introduction of masking allows MGLU to match the parameter count and inference-time memory load of a standard Linear Unit (LU) layer more closely, despite retaining gating expressivity. The following table provides the per-layer parameter and memory-access costs, excerpted from (Tajima et al., 29 Jun 2025):

Layer Type	#FP16 Params	#Binary Params	Mem. Load per Token (bits)
LU (GELU)	$h\,d$	$0$	$16\,h\,d$
GLU	$2\,h\,d$	$0$	$32\,h\,d$
MGLU ( $n_m$ masks)	$h\,d$	$n_m\,h\,d$	$(16 + n_m)\,h\,d$

For $n_m = 1$ , MGLU requires $17\,h\,d$ bits per token—47% less than standard GLU ( $32\,h\,d$ ). All gate/value routing information is encoded via bitwise operations over compact binary masks.

4. Kernel Optimization and FlashMGLU

A hardware-centric kernel, FlashMGLU, is introduced to exploit MGLU's compressed representation. Key implementation strategies include:

Packing $n_m$ binary mask bits per weight entry into a single 8-bit integer, enabling multi-route mask access with a single read.
Tiling $W$ and $\bm{x}$ along the matrix's "K" dimension for coalesced memory operations.
On-chip computation of all gated and ungated dot products using fused CUDA or Triton kernels.

Measured on an RTX 5090, FlashMGLU achieves a 19.7× speed-up (for $n_m=8$ ) relative to naive PyTorch MGLU (0.0265 ms vs 0.5210 ms), and a 1.51× improvement over standard GLU (0.0202 ms vs 0.0306 ms at $n_m=1$ ). The memory bandwidth required for intermediate FFN weights per layer is reduced from $2 \times$ FP16 arrays in SwiGLU to $1 \times$ FP16 $+ n_m$ bits in SwiMGLU.

5. Empirical Evaluation in LLMs

Comprehensive ablations replace all SwiGLU FFNs in Llama-3-style decoders with SwiMGLU, trained on FineWeb-Edu. Results on six zero- and two-shot evaluation benchmarks show the following (selected):

Variant	#Weights	Avg. Accuracy (Small, 159M)	Avg. Accuracy (Large, 1.08B)
SwiGLU	141M	46.20%	56.00%
SwiMGLU ( $n_m=1$ )	113M	46.06%	—
SwiMGLU ( $n_m=4$ )	113M	46.48%	56.85%
SwiMGLU ( $n_m=8$ )	113M	46.49%	—

SwiMGLU achieves accuracy that matches or slightly surpasses SwiGLU for $n_m \geq 4$ , with a 19–33% reduction in model storage. End-to-end LLM inference throughput rises by 1.3× in token-decode latency for $n_m=1$ , with speed gains persisting up to $n_m=8$ .

6. Trade-offs, Sweet Spots, and Limitations

Increasing $n_m$ improves accuracy slightly, but linearly increases the binary parameter count and the cost of mask learning during training. However, inference remains memory-bound; thus, the reduction in FP16 weight transfer dominates efficiency gains. In practice, $n_m=4$ is identified as the empirical "sweet spot" balancing accuracy and minimal latency overhead. Because the masks are binary and fixed at inference, no sampling variance or stochastic routing overhead is introduced during deployment.

7. Comparative Analysis and Impact

MGLUs generalize the expressivity of two-matrix GLUs while equalling or exceeding performance at a sharply lower memory footprint. The approach is agnostic to activation choice, with Swish activations (SwiMGLU) demonstrating competitive or superior results on downstream benchmarks. The kernel-level advantages realized by FlashMGLU make MGLU especially suitable for memory-bound, single-batch inference workloads in modern LLMs, where memory traffic, not arithmetic, is the primary bottleneck (Tajima et al., 29 Jun 2025).

No recognized studies prior to (Tajima et al., 29 Jun 2025) define or implement a mask-based GLU variant; MGLU represents the first such architecture with end-to-end evaluation and custom kernel realization for LLM-scale models.

Markdown Report Issue Upgrade to Chat

References (1)

Masked Gated Linear Unit (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Masked Gated Linear Unit (MGLU).

Masked Gated Linear Unit (MGLU)

1. Background and Motivation

2. Architectural Formulation

3. Memory, Parameterization, and Efficiency

4. Kernel Optimization and FlashMGLU

5. Empirical Evaluation in LLMs

6. Trade-offs, Sweet Spots, and Limitations

7. Comparative Analysis and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Masked Gated Linear Unit (MGLU)

1. Background and Motivation

2. Architectural Formulation

3. Memory, Parameterization, and Efficiency

4. Kernel Optimization and FlashMGLU

5. Empirical Evaluation in LLMs

6. Trade-offs, Sweet Spots, and Limitations

7. Comparative Analysis and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research