Papers
Topics
Authors
Recent
Search
2000 character limit reached

HyperGLU Block Design

Updated 16 February 2026
  • HyperGLU is a neural network block that reformulates autoregressive self-attention as a dynamic two-layer MLP with context-dependent parameters.
  • It replaces softmax normalization with gated linear units and employs diagonal-plus-low-rank mixing for flexible, efficient feature and sequence routing.
  • The design uses reverse-offset layouts to enforce autoregressive consistency while matching or outperforming traditional attention within fixed parameter budgets.

HyperGLU is a neural network block introduced within the HyperMLP framework, which reconceptualizes autoregressive self-attention as a dynamic two-layer multi-layer perceptron (MLP) whose parameters are instantiated from context history. HyperGLU replaces softmax-normalized attention with a gated linear unit (GLU) structure that decouples routing and weighting across a dynamically constructed memory pool, providing greater representational flexibility and dynamically learned mixing in both feature and sequence spaces. The architecture leverages a "reverse-offset" (lag) layout to ensure autoregressive truncation consistency and employs diagonal-plus-low-rank (DPLR) mixing for parameter efficiency and expressivity. This design consistently matches or outperforms classical softmax-attention heads under fixed parameter budgets, with theoretical and empirical advantages in context mixing and boundary structure (Lu et al., 13 Feb 2026).

1. Reformulation of Attention as Dynamic MLP

Traditional self-attention computes a weighted sum of values using softmax-normalized query-key similarities, effectively restricting attention scores to a probability simplex. HyperGLU, following the paradigm proposed by Lu & Yang (2026), interprets an autoregressive (AR) attention head as a depth-two MLP with context-length-dependent width, where the history X1:tX_{1:t} dynamically determines first- and second-layer weight matrices through factorized hypernetwork parameterizations. Rather than probabilities, scores htRth_t\in\mathbb{R}^t are generic pre-activations, allowing for broader forms of input-conditioned selection via nonlinearity.

Unlike softmax attention, which enforces convex combinations, HyperGLU employs GLU gating: ReLU or GLU activations over the slot axis permit selection/routing of tokens independently of normalization, resulting in more flexible and expressive context mixing, including learnable sequence mixing (see Table 1, section 4 below). The activation function design further decouples magnitude modulation from selection, with the gate (ReLU) and scale (Softplus) components acting independently.

2. Mathematical Specification

Let xtR1×dx_t\in\mathbb{R}^{1\times d} denote the input at time tt, and Xt:1=[xt;;x1]Rt×dX_{t:1}=[x_t;\dots;x_1]\in\mathbb{R}^{t\times d} the lag-ordered prefix (newest first). The HyperGLU block, viewed as a residual update, is:

Ot=xt+o(ht)W(2)(Xt:1),ht=xtW(1)(Xt:1)R1×tO_t = x_t + o(h_t) W^{(2)}(X_{t:1}), \qquad h_t = x_t W^{(1)}(X_{t:1}) \in \mathbb{R}^{1\times t}

HyperGLU factorizes both W(1)W^{(1)} and W(2)W^{(2)} using context-dependent hypernetworks:

W(1)(Xt:1)=L(1)(xt)Xt:1R(1)(xt)Rd×tW^{(1)}(X_{t:1}) = L^{(1)}(x_t)\, X_{t:1}^\top R^{(1)}(x_t) \in \mathbb{R}^{d\times t}

W(2)(Xt:1)=(R(2)(xt))Xt:1L(2)(xt)Rt×dW^{(2)}(X_{t:1}) = (R^{(2)}(x_t))^\top\, X_{t:1} L^{(2)}(x_t) \in \mathbb{R}^{t\times d}

  • L(1)(xt):R1×dRd×dqkL^{(1)}(x_t): \mathbb{R}^{1\times d} \to \mathbb{R}^{d\times d_{qk}}
  • L(2)(xt):R1×dRdvo×dL^{(2)}(x_t): \mathbb{R}^{1\times d} \to \mathbb{R}^{d_{vo}\times d}
  • R(j)(xt):R1×dRt×tR^{(j)}(x_t): \mathbb{R}^{1\times d} \to \mathbb{R}^{t\times t} for j=1,2j=1,2

The DPLR parameterization for each sequence mixer R(j)(xt)R^{(j)}(x_t):

R(j)(xt)=D(j)+A(j)Diag(s(j)(xt))(B(j)),s(j)(xt)=σ(xtWs(j))RrR^{(j)}(x_t) = D^{(j)} + A^{(j)}\,\mathrm{Diag}(s^{(j)}(x_t)) (B^{(j)})^\top, \qquad s^{(j)}(x_t) = \sigma(x_t W^{(j)}_s)\in \mathbb{R}^r

with low ranks rtr\ll t, typically r16r\approx 16.

The first-layer output hth_t is split into gate and scale branches. Activation proceeds as:

[hgate,hscale]=[xtWgate(1),xtWscale(1)]R1×t×R1×t at=Softplus(hscale)    ReLU(hgatehgate22+ε) Ot=xt+atW(2)(Xt:1)\begin{align*} [h_{\text{gate}}, h_{\text{scale}}] &= [x_t W^{(1)}_{\text{gate}}, x_t W^{(1)}_{\text{scale}}] \in \mathbb{R}^{1\times t} \times \mathbb{R}^{1\times t} \ a_t &= \mathrm{Softplus}(h_{\text{scale}})\; \odot\; \mathrm{ReLU}\left(\frac{h_{\text{gate}}}{\sqrt{||h_{\text{gate}}||_2^2 + \varepsilon}}\right) \ O_t & = x_t + a_t W^{(2)}(X_{t:1}) \end{align*}

The ReLU\mathrm{ReLU} gate controls routing (binary selection), whereas Softplus modulates magnitude per slot.

AR-truncation consistency is enforced by the canonical top-left extension of R(j)R^{(j)} in lag-order: adding more distant tokens does not change the output on recent windows, by applying PtTP_{t\leftarrow T} prefix slicing.

3. Architectural Workflow and Implementation

The overall forward pass for a single HyperGLU head at step tt follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
G = x_t @ W_gate          # (1×r)
S = x_t @ W_scale         # (1×r)

s1 = sigmoid(G)           # (1×r)
R1 = I_t + diag(p1) + A1 @ diag(s1) @ B1.T  # (t×t)
s2 = sigmoid(S)           # (1×r)
R2 = I_t + diag(p2) + A2 @ diag(s2) @ B2.T  # (t×t)

W1 = L1(x_t) @ X.T @ R1   # (d×t)
h_gate  = x_t @ W1        # (1×t)
h_scale = ...             # analogous computation if using distinct weights

routed = ReLU( h_gate / sqrt(||h_gate||^2+ε) )
scale  = Softplus( h_scale )
a_t    = routed  scale   # (1×t)

W2 = R2.T @ X @ L2(x_t)   # (t×d)

O_t = x_t + a_t @ W2      # (1×d)

No dense t×tt\times t matrices are materialized; all mixing is performed through low-rank contractions and efficient fused kernels.

4. Comparison with Softmax-Attention and GLU-MLP

Aspect Softmax-Attention Standard GLU-MLP HyperGLU
Normalization softmax\mathrm{softmax} (slot axis) None (ReLU+LN on features) L2Normt\mathrm{L2Norm}_t + ReLU/GLU (slot axis)
Routing Probabilistic weights Feature-axis gating Dynamic two-layer slot routing
Parameter Instantiation Fixed WqWkW_qW_k^\top Static weights L(1)XR(1)(x)L^{(1)}X^\top R^{(1)}(x) (hypernetwork)
Sequence Mixing Identity (t×tt\times t) None Learned DPLR in both layers
Expressivity CPWL in xx, O(td)\le O(t^d) regions CPWL in features PW-smooth, warped partitions + bases
Budget Allocation O(ddqk+ddvo)O(d d_{qk} + d d_{vo}) O(d2+d2)O(d^2 + d^2) Same; pays for seq-mix by reducing dqkd_{qk}
Key Benefit Parallel AR & KV cache Efficient feature gating Richer context routing, curved boundaries

HyperGLU generalizes both softmax-attention and standard GLU gating by introducing nonlinear, dynamically-warped gating boundaries and parameter-efficient, learned sequence mixing.

5. Expressivity and Theoretical Properties

5.1 Routing Boundary Structure

With static mixing, token-wise ReLU attention induces polyhedral partitions of the input space, segmenting Rd\mathbb{R}^d by (xB)i=0(xB)_i=0, for B=L(1)XB=L^{(1)}X^\top. HyperMLP/HyperGLU with dynamic L(1)(x)L^{(1)}(x) and R(1)(x)R^{(1)}(x) replaces polyhedral regions with piecewise-smooth (C1C^1) curved gating hypersurfaces: {x:hi(x;X)=0}\{x : h_i(x;X)=0\}, where h(x;X)=xL(1)(x)XR(1)(x)h(x;X) = xL^{(1)}(x)X^\top R^{(1)}(x). This strictly expands the functional expressivity.

5.2 Decoupled Gating and Scaling

Proposition 2.4 formalizes that, under HyperGLU,

at=Softplus(hscale)ReLU(hgatep(hgate))a_t = \mathrm{Softplus}(h_{\rm scale}) \odot \mathrm{ReLU}(\tfrac{h_{\rm gate}}{p(h_{\rm gate})})

with ReLU determining the routed subset and Softplus modulating activation magnitudes independently.

5.3 Budget Considerations

Theorem 2.5 demonstrates that reducing first-layer (routing) rank W1W_1 is less detrimental than shrinking the second-layer (action) rank W2W_2, since W2W_2 defines the update subspace. Thus, HyperGLU can trade QK width for sequence mixing richness, matching or surpassing ReLU-attention expressivity at constant parameter cost.

5.4 Parameter Cost Matching

Given a classical ReLU-attention head cost Patt(dqk,dvo)=2ddqk+2ddvoP_{\rm att}(d_{qk},d_{vo})=2d\,d_{qk}+2d\,d_{vo}, HyperGLU maintains this envelope by choosing dqkhypo<dqkbased_{qk}^{\rm hypo}<d_{qk}^{\rm base} such that 2d(dqkbasedqkhypo)O(trs)2d\, (d_{qk}^{\rm base}-d_{qk}^{\rm hypo}) \ge O(t\,r_s), allocating the surplus to temporal DPLR sequence mixing.

6. Practical Considerations and Hyperparameters

Recommended settings for typical sequence modeling tasks include:

  • Model width dd, number of heads nhead=2n_{\rm head}=2.
  • Feature ranks dqk=d/8d_{qk}=d/8, dvo=d/2d_{vo}=d/2 for two heads, preserving 4d24d^2 total parameters.
  • Sequence mixing rank rs=16r_s=16.
  • Diagonal cores D(j)=I+diag(p(j))D^{(j)}=I+\mathrm{diag}(p^{(j)}) with p(j)Rtp^{(j)}\in\mathbb{R}^t learned.
  • Gates use Sigmoid\mathrm{Sigmoid}, stabilization parameter ε=1012\varepsilon=10^{-12}.
  • Depthwise convolution (kernel size 4) can optionally enhance mixing.
  • Lag-ordered prefix is stored as a single buffer; all DPLR mixing achieved via two low-rank GEMMs plus a fused ‘epilogue’ kernel.
  • Training complexity is dominated by O(T2d)O(T^2 d), with additional O(T2rs)O(T^2 r_s) for sequence mixing; per-step inference is O(td)+O(trs)O(td)+O(tr_s).

HyperGLU blocks can be efficiently implemented using fused kernels (e.g., Triton), with memory and computation scaling controlled by rank and width hyperparameters, supporting deployment with similar or better efficiency to conventional attention heads (Lu et al., 13 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HyperGLU Block Design.