HyperGLU Block Design
- HyperGLU is a neural network block that reformulates autoregressive self-attention as a dynamic two-layer MLP with context-dependent parameters.
- It replaces softmax normalization with gated linear units and employs diagonal-plus-low-rank mixing for flexible, efficient feature and sequence routing.
- The design uses reverse-offset layouts to enforce autoregressive consistency while matching or outperforming traditional attention within fixed parameter budgets.
HyperGLU is a neural network block introduced within the HyperMLP framework, which reconceptualizes autoregressive self-attention as a dynamic two-layer multi-layer perceptron (MLP) whose parameters are instantiated from context history. HyperGLU replaces softmax-normalized attention with a gated linear unit (GLU) structure that decouples routing and weighting across a dynamically constructed memory pool, providing greater representational flexibility and dynamically learned mixing in both feature and sequence spaces. The architecture leverages a "reverse-offset" (lag) layout to ensure autoregressive truncation consistency and employs diagonal-plus-low-rank (DPLR) mixing for parameter efficiency and expressivity. This design consistently matches or outperforms classical softmax-attention heads under fixed parameter budgets, with theoretical and empirical advantages in context mixing and boundary structure (Lu et al., 13 Feb 2026).
1. Reformulation of Attention as Dynamic MLP
Traditional self-attention computes a weighted sum of values using softmax-normalized query-key similarities, effectively restricting attention scores to a probability simplex. HyperGLU, following the paradigm proposed by Lu & Yang (2026), interprets an autoregressive (AR) attention head as a depth-two MLP with context-length-dependent width, where the history dynamically determines first- and second-layer weight matrices through factorized hypernetwork parameterizations. Rather than probabilities, scores are generic pre-activations, allowing for broader forms of input-conditioned selection via nonlinearity.
Unlike softmax attention, which enforces convex combinations, HyperGLU employs GLU gating: ReLU or GLU activations over the slot axis permit selection/routing of tokens independently of normalization, resulting in more flexible and expressive context mixing, including learnable sequence mixing (see Table 1, section 4 below). The activation function design further decouples magnitude modulation from selection, with the gate (ReLU) and scale (Softplus) components acting independently.
2. Mathematical Specification
Let denote the input at time , and the lag-ordered prefix (newest first). The HyperGLU block, viewed as a residual update, is:
HyperGLU factorizes both and using context-dependent hypernetworks:
- for
The DPLR parameterization for each sequence mixer :
with low ranks , typically .
The first-layer output is split into gate and scale branches. Activation proceeds as:
The gate controls routing (binary selection), whereas Softplus modulates magnitude per slot.
AR-truncation consistency is enforced by the canonical top-left extension of in lag-order: adding more distant tokens does not change the output on recent windows, by applying prefix slicing.
3. Architectural Workflow and Implementation
The overall forward pass for a single HyperGLU head at step follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
G = x_t @ W_gate # (1×r) S = x_t @ W_scale # (1×r) s1 = sigmoid(G) # (1×r) R1 = I_t + diag(p1) + A1 @ diag(s1) @ B1.T # (t×t) s2 = sigmoid(S) # (1×r) R2 = I_t + diag(p2) + A2 @ diag(s2) @ B2.T # (t×t) W1 = L1(x_t) @ X.T @ R1 # (d×t) h_gate = x_t @ W1 # (1×t) h_scale = ... # analogous computation if using distinct weights routed = ReLU( h_gate / sqrt(||h_gate||^2+ε) ) scale = Softplus( h_scale ) a_t = routed ⨀ scale # (1×t) W2 = R2.T @ X @ L2(x_t) # (t×d) O_t = x_t + a_t @ W2 # (1×d) |
No dense matrices are materialized; all mixing is performed through low-rank contractions and efficient fused kernels.
4. Comparison with Softmax-Attention and GLU-MLP
| Aspect | Softmax-Attention | Standard GLU-MLP | HyperGLU |
|---|---|---|---|
| Normalization | (slot axis) | None (ReLU+LN on features) | + ReLU/GLU (slot axis) |
| Routing | Probabilistic weights | Feature-axis gating | Dynamic two-layer slot routing |
| Parameter Instantiation | Fixed | Static weights | (hypernetwork) |
| Sequence Mixing | Identity () | None | Learned DPLR in both layers |
| Expressivity | CPWL in , regions | CPWL in features | PW-smooth, warped partitions + bases |
| Budget Allocation | Same; pays for seq-mix by reducing | ||
| Key Benefit | Parallel AR & KV cache | Efficient feature gating | Richer context routing, curved boundaries |
HyperGLU generalizes both softmax-attention and standard GLU gating by introducing nonlinear, dynamically-warped gating boundaries and parameter-efficient, learned sequence mixing.
5. Expressivity and Theoretical Properties
5.1 Routing Boundary Structure
With static mixing, token-wise ReLU attention induces polyhedral partitions of the input space, segmenting by , for . HyperMLP/HyperGLU with dynamic and replaces polyhedral regions with piecewise-smooth () curved gating hypersurfaces: , where . This strictly expands the functional expressivity.
5.2 Decoupled Gating and Scaling
Proposition 2.4 formalizes that, under HyperGLU,
with ReLU determining the routed subset and Softplus modulating activation magnitudes independently.
5.3 Budget Considerations
Theorem 2.5 demonstrates that reducing first-layer (routing) rank is less detrimental than shrinking the second-layer (action) rank , since defines the update subspace. Thus, HyperGLU can trade QK width for sequence mixing richness, matching or surpassing ReLU-attention expressivity at constant parameter cost.
5.4 Parameter Cost Matching
Given a classical ReLU-attention head cost , HyperGLU maintains this envelope by choosing such that , allocating the surplus to temporal DPLR sequence mixing.
6. Practical Considerations and Hyperparameters
Recommended settings for typical sequence modeling tasks include:
- Model width , number of heads .
- Feature ranks , for two heads, preserving total parameters.
- Sequence mixing rank .
- Diagonal cores with learned.
- Gates use , stabilization parameter .
- Depthwise convolution (kernel size 4) can optionally enhance mixing.
- Lag-ordered prefix is stored as a single buffer; all DPLR mixing achieved via two low-rank GEMMs plus a fused ‘epilogue’ kernel.
- Training complexity is dominated by , with additional for sequence mixing; per-step inference is .
HyperGLU blocks can be efficiently implemented using fused kernels (e.g., Triton), with memory and computation scaling controlled by rank and width hyperparameters, supporting deployment with similar or better efficiency to conventional attention heads (Lu et al., 13 Feb 2026).