HyperGLU Block Design

Updated 16 February 2026

HyperGLU is a neural network block that reformulates autoregressive self-attention as a dynamic two-layer MLP with context-dependent parameters.
It replaces softmax normalization with gated linear units and employs diagonal-plus-low-rank mixing for flexible, efficient feature and sequence routing.
The design uses reverse-offset layouts to enforce autoregressive consistency while matching or outperforming traditional attention within fixed parameter budgets.

HyperGLU is a neural network block introduced within the HyperMLP framework, which reconceptualizes autoregressive self-attention as a dynamic two-layer multi-layer perceptron (MLP) whose parameters are instantiated from context history. HyperGLU replaces softmax-normalized attention with a gated linear unit (GLU) structure that decouples routing and weighting across a dynamically constructed memory pool, providing greater representational flexibility and dynamically learned mixing in both feature and sequence spaces. The architecture leverages a "reverse-offset" (lag) layout to ensure autoregressive truncation consistency and employs diagonal-plus-low-rank (DPLR) mixing for parameter efficiency and expressivity. This design consistently matches or outperforms classical softmax-attention heads under fixed parameter budgets, with theoretical and empirical advantages in context mixing and boundary structure (Lu et al., 13 Feb 2026).

1. Reformulation of Attention as Dynamic MLP

Traditional self-attention computes a weighted sum of values using softmax-normalized query-key similarities, effectively restricting attention scores to a probability simplex. HyperGLU, following the paradigm proposed by Lu & Yang (2026), interprets an autoregressive (AR) attention head as a depth-two MLP with context-length-dependent width, where the history $X_{1:t}$ dynamically determines first- and second-layer weight matrices through factorized hypernetwork parameterizations. Rather than probabilities, scores $h_t\in\mathbb{R}^t$ are generic pre-activations, allowing for broader forms of input-conditioned selection via nonlinearity.

Unlike softmax attention, which enforces convex combinations, HyperGLU employs GLU gating: ReLU or GLU activations over the slot axis permit selection/routing of tokens independently of normalization, resulting in more flexible and expressive context mixing, including learnable sequence mixing (see Table 1, section 4 below). The activation function design further decouples magnitude modulation from selection, with the gate (ReLU) and scale (Softplus) components acting independently.

2. Mathematical Specification

Let $x_t\in\mathbb{R}^{1\times d}$ denote the input at time $t$ , and $X_{t:1}=[x_t;\dots;x_1]\in\mathbb{R}^{t\times d}$ the lag-ordered prefix (newest first). The HyperGLU block, viewed as a residual update, is:

$O_t = x_t + o(h_t) W^{(2)}(X_{t:1}), \qquad h_t = x_t W^{(1)}(X_{t:1}) \in \mathbb{R}^{1\times t}$

HyperGLU factorizes both $W^{(1)}$ and $W^{(2)}$ using context-dependent hypernetworks:

$W^{(1)}(X_{t:1}) = L^{(1)}(x_t)\, X_{t:1}^\top R^{(1)}(x_t) \in \mathbb{R}^{d\times t}$

$W^{(2)}(X_{t:1}) = (R^{(2)}(x_t))^\top\, X_{t:1} L^{(2)}(x_t) \in \mathbb{R}^{t\times d}$

$L^{(1)}(x_t): \mathbb{R}^{1\times d} \to \mathbb{R}^{d\times d_{qk}}$
$L^{(2)}(x_t): \mathbb{R}^{1\times d} \to \mathbb{R}^{d_{vo}\times d}$
$R^{(j)}(x_t): \mathbb{R}^{1\times d} \to \mathbb{R}^{t\times t}$ for $j=1,2$

The DPLR parameterization for each sequence mixer $R^{(j)}(x_t)$ :

$R^{(j)}(x_t) = D^{(j)} + A^{(j)}\,\mathrm{Diag}(s^{(j)}(x_t)) (B^{(j)})^\top, \qquad s^{(j)}(x_t) = \sigma(x_t W^{(j)}_s)\in \mathbb{R}^r$

with low ranks $r\ll t$ , typically $r\approx 16$ .

The first-layer output $h_t$ is split into gate and scale branches. Activation proceeds as:

$\begin{align*} [h_{\text{gate}}, h_{\text{scale}}] &= [x_t W^{(1)}_{\text{gate}}, x_t W^{(1)}_{\text{scale}}] \in \mathbb{R}^{1\times t} \times \mathbb{R}^{1\times t} \ a_t &= \mathrm{Softplus}(h_{\text{scale}})\; \odot\; \mathrm{ReLU}\left(\frac{h_{\text{gate}}}{\sqrt{||h_{\text{gate}}||_2^2 + \varepsilon}}\right) \ O_t & = x_t + a_t W^{(2)}(X_{t:1}) \end{align*}$

The $\mathrm{ReLU}$ gate controls routing (binary selection), whereas Softplus modulates magnitude per slot.

AR-truncation consistency is enforced by the canonical top-left extension of $R^{(j)}$ in lag-order: adding more distant tokens does not change the output on recent windows, by applying $P_{t\leftarrow T}$ prefix slicing.

3. Architectural Workflow and Implementation

The overall forward pass for a single HyperGLU head at step $t$ follows:

G = x_t @ W_gate          # (1×r)
S = x_t @ W_scale         # (1×r)

s1 = sigmoid(G)           # (1×r)
R1 = I_t + diag(p1) + A1 @ diag(s1) @ B1.T  # (t×t)
s2 = sigmoid(S)           # (1×r)
R2 = I_t + diag(p2) + A2 @ diag(s2) @ B2.T  # (t×t)

W1 = L1(x_t) @ X.T @ R1   # (d×t)
h_gate  = x_t @ W1        # (1×t)
h_scale = ...             # analogous computation if using distinct weights

routed = ReLU( h_gate / sqrt(||h_gate||^2+ε) )
scale  = Softplus( h_scale )
a_t    = routed ⨀ scale   # (1×t)

W2 = R2.T @ X @ L2(x_t)   # (t×d)

O_t = x_t + a_t @ W2      # (1×d)

No dense $t\times t$ matrices are materialized; all mixing is performed through low-rank contractions and efficient fused kernels.

4. Comparison with Softmax-Attention and GLU-MLP

Aspect	Softmax-Attention	Standard GLU-MLP	HyperGLU
Normalization	$\mathrm{softmax}$ (slot axis)	None (ReLU+LN on features)	$\mathrm{L2Norm}_t$ + ReLU/GLU (slot axis)
Routing	Probabilistic weights	Feature-axis gating	Dynamic two-layer slot routing
Parameter Instantiation	Fixed $W_qW_k^\top$	Static weights	$L^{(1)}X^\top R^{(1)}(x)$ (hypernetwork)
Sequence Mixing	Identity ( $t\times t$ )	None	Learned DPLR in both layers
Expressivity	CPWL in $x$ , $\le O(t^d)$ regions	CPWL in features	PW-smooth, warped partitions + bases
Budget Allocation	$O(d d_{qk} + d d_{vo})$	$O(d^2 + d^2)$	Same; pays for seq-mix by reducing $d_{qk}$
Key Benefit	Parallel AR & KV cache	Efficient feature gating	Richer context routing, curved boundaries

HyperGLU generalizes both softmax-attention and standard GLU gating by introducing nonlinear, dynamically-warped gating boundaries and parameter-efficient, learned sequence mixing.

5. Expressivity and Theoretical Properties

5.1 Routing Boundary Structure

With static mixing, token-wise ReLU attention induces polyhedral partitions of the input space, segmenting $\mathbb{R}^d$ by $(xB)_i=0$ , for $B=L^{(1)}X^\top$ . HyperMLP/HyperGLU with dynamic $L^{(1)}(x)$ and $R^{(1)}(x)$ replaces polyhedral regions with piecewise-smooth ( $C^1$ ) curved gating hypersurfaces: $\{x : h_i(x;X)=0\}$ , where $h(x;X) = xL^{(1)}(x)X^\top R^{(1)}(x)$ . This strictly expands the functional expressivity.

5.2 Decoupled Gating and Scaling

Proposition 2.4 formalizes that, under HyperGLU,

$a_t = \mathrm{Softplus}(h_{\rm scale}) \odot \mathrm{ReLU}(\tfrac{h_{\rm gate}}{p(h_{\rm gate})})$

with ReLU determining the routed subset and Softplus modulating activation magnitudes independently.

5.3 Budget Considerations

Theorem 2.5 demonstrates that reducing first-layer (routing) rank $W_1$ is less detrimental than shrinking the second-layer (action) rank $W_2$ , since $W_2$ defines the update subspace. Thus, HyperGLU can trade QK width for sequence mixing richness, matching or surpassing ReLU-attention expressivity at constant parameter cost.

5.4 Parameter Cost Matching

Given a classical ReLU-attention head cost $P_{\rm att}(d_{qk},d_{vo})=2d\,d_{qk}+2d\,d_{vo}$ , HyperGLU maintains this envelope by choosing $d_{qk}^{\rm hypo}<d_{qk}^{\rm base}$ such that $2d\, (d_{qk}^{\rm base}-d_{qk}^{\rm hypo}) \ge O(t\,r_s)$ , allocating the surplus to temporal DPLR sequence mixing.

6. Practical Considerations and Hyperparameters

Recommended settings for typical sequence modeling tasks include:

Model width $d$ , number of heads $n_{\rm head}=2$ .
Feature ranks $d_{qk}=d/8$ , $d_{vo}=d/2$ for two heads, preserving $4d^2$ total parameters.
Sequence mixing rank $r_s=16$ .
Diagonal cores $D^{(j)}=I+\mathrm{diag}(p^{(j)})$ with $p^{(j)}\in\mathbb{R}^t$ learned.
Gates use $\mathrm{Sigmoid}$ , stabilization parameter $\varepsilon=10^{-12}$ .
Depthwise convolution (kernel size 4) can optionally enhance mixing.
Lag-ordered prefix is stored as a single buffer; all DPLR mixing achieved via two low-rank GEMMs plus a fused ‘epilogue’ kernel.
Training complexity is dominated by $O(T^2 d)$ , with additional $O(T^2 r_s)$ for sequence mixing; per-step inference is $O(td)+O(tr_s)$ .

HyperGLU blocks can be efficiently implemented using fused kernels (e.g., Triton), with memory and computation scaling controlled by rank and width hyperparameters, supporting deployment with similar or better efficiency to conventional attention heads (Lu et al., 13 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

HyperMLP: An Integrated Perspective for Sequence Modeling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HyperGLU Block Design.

HyperGLU Block Design

1. Reformulation of Attention as Dynamic MLP

2. Mathematical Specification

3. Architectural Workflow and Implementation

4. Comparison with Softmax-Attention and GLU-MLP

5. Expressivity and Theoretical Properties

5.1 Routing Boundary Structure

5.2 Decoupled Gating and Scaling

5.3 Budget Considerations

5.4 Parameter Cost Matching

6. Practical Considerations and Hyperparameters

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HyperGLU Block Design

1. Reformulation of Attention as Dynamic MLP

2. Mathematical Specification

3. Architectural Workflow and Implementation

4. Comparison with Softmax-Attention and GLU-MLP

5. Expressivity and Theoretical Properties

5.1 Routing Boundary Structure

5.2 Decoupled Gating and Scaling

5.3 Budget Considerations

5.4 Parameter Cost Matching

6. Practical Considerations and Hyperparameters

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research