gMLP Architecture Overview

Updated 7 September 2025

gMLP architecture is a neural network model that integrates explicit gating mechanisms with static spatial mixing to efficiently model cross-token interactions across various domains.
It employs spatial gating units that split, transform, and gate features, enabling fixed yet expressive feature mixing without the dynamic complexity of self-attention.
gMLP-based models achieve competitive performance in vision, language, graph, and tabular tasks while offering enhanced parameter efficiency and parallelism.

Gated Multilayer Perceptron (gMLP) architectures constitute a family of neural network models that enhance traditional multilayer perceptrons through explicit gating mechanisms and spatial (or feature) mixing operations. Originally proposed as a scalable alternative to self-attention architectures such as Transformers, gMLP and its derivatives have proven effective across a variety of domains, including vision, natural language processing, graph learning, and tabular data, by incorporating learned gating units to model cross-feature or cross-token interactions in a parameter-efficient, highly parallelizable manner.

1. Core Principles and Architectural Design

The canonical gMLP block is designed as a stack of identical modules, each consisting of several key stages:

Normalization (e.g., LayerNorm or BatchNorm).
Channel (Feedforward) Projection, typically expanding the hidden dimension.
Nonlinear Activation (commonly GeLU or its variants).
Spatial Gating Unit (SGU), which splits hidden channels, applies a learned linear projection in the spatial (token) dimension to one half, and gates the other half by multiplication:

$s(Z) = Z_1 \odot (W Z_2 + b)$

where $Z_1$ , $Z_2$ are splits of the input, $W$ is a trainable spatial mixing matrix, $b$ is a bias vector, and $\odot$ denotes element-wise multiplication.

Projection Back to Model Dimension, followed by a residual connection.

Key distinctions from classic MLPs include the presence of the spatial gating layer, enabling cross-token information flow, and a parameterization in which all “mixing” is statically learned, contrasting with dynamic attention patterns. Unlike self-attention, spatial mixing in gMLP is fixed after training and does not depend on the content of input tokens during inference (Liu et al., 2021).

2. Gating Mechanisms: Mathematical and Practical Aspects

The gating mechanism central to gMLP architectures operates by learning to modulate (allow or suppress) the contribution of certain features or tokens:

Channels or tokens are partitioned; a linear projection (or more generally, an MLP) is applied to one half, whose output is used as a gate (often post-activated by a function such as sigmoid, Swish, or SiLU) for the other half.
A typical formula, as seen in gMLP or its variants (including TKGMLP and giMLP), is:

$\text{Gate}(x) = \sigma(xV + b_1) \odot (xU + b_2)$

with $V,U$ weights, $b_1,b_2$ biases, and $\sigma$ a nonlinear function such as SiLU.

In inhibition-enhanced variants (giMLP), an additional learned inhibition function can dampen or restrict certain activations, further regularizing the gated pathway (Kang et al., 2022).

This design enables highly expressive, content-adaptive transformations, offering improved representational power even without explicit attention mechanisms.

3. Spatial and Cross-Feature Mixing

gMLP generalizes token mixing via static spatial projections. The SGU’s projection matrix $W$ directly specifies how the output at each spatial position (or feature bin) depends on others. Since $W$ does not change per input, all cross-feature dependencies are learned during training and fixed at inference.

In vision, with input $X \in \mathbb{R}^{n \times d}$ (n tokens, d features), the spatial gating unit allows every position to incorporate information from all other tokens, with the degree of influence learned via $W$ .
Extensions such as CycleMLP and Caterpillar replace global spatial mixing with mechanisms that aggregate local context via cyclic or shift-and-concatenate operations, achieving better efficiency and scale flexibility for dense prediction tasks (Chen et al., 2021, Sun et al., 2023).
In graph domains, feature-message passing gMLP architectures aggregate multi-hop neighbor information through precomputed feature messages, decoupling the update process from the propagation step for greatly enhanced scalability (Zhang et al., 2021).

4. Variants and Extensions Across Domains

Numerous variants of gMLP have been proposed to address specific domain requirements:

Variant	Domain(s)	Key Innovations
gMLP (Liu et al., 2021)	Vision, NLP	Spatial gating, static mixing
GMLP (Kachuee et al., 2019)	Tabular	Group-wise MLP, adaptive grouping
Graph gMLP (Zhang et al., 2021)	Graphs	Message passing–MLP separation
CycleMLP (Chen et al., 2021)	Vision (Dense)	Cycle FC, linear spatial complexity
S $^2$ -MLPv2 (Yu et al., 2021)	Vision (ImageNet)	Split-attention fusion, pyramid stages
giMLP (Kang et al., 2022)	Vision, Language	Gating with inhibition for regularization
Caterpillar (Sun et al., 2023)	Vision	Shifted-pillars-concatenation, window-free locality
CgMLP (Rajagopal et al., 2021)	Vision	Pre-conv layers + flexible gating
TKGMLP (Zhang et al., 3 Dec 2024)	Tabular (Finance)	SwiGLU gating, quantile linear encoding

These variants adapt the core gMLP principle—static parameterized feature mixing with gating—and refine it based on computational, statistical, and domain-specific considerations.

5. Computational Complexity and Scaling Considerations

A central advantage of gMLP-based architectures is their computational simplicity:

Parameter and FLOPs Efficiency: For input matrices (e.g., image patches or tokens), spatial mixing is performed via a matrix multiplication whose size depends on the number of spatial positions. This can introduce quadratic cost $O(N^2)$ in “vanilla” gMLP, where $N$ is the number of tokens/pixels.
Scalability: Hierarchical or local mixing variants such as CycleMLP, S $^2$ -MLPv2, and Caterpillar use windowed or cyclic mixing to reduce complexity to linear in $N$ , removing the size bottleneck and enabling efficient processing of high-resolution images and dense prediction tasks (Chen et al., 2021, Yu et al., 2021, Sun et al., 2023).
Parallelism: All mixing operations are statically parameterized and can be parallelized across modern accelerators, unlike dynamic attention mechanisms.
Graph Domains: In feature-message passing gMLP for GNNs, precomputing all-hop feature propagation reduces distributed communication costs by orders of magnitude versus classic neural message passing (Zhang et al., 2021).

6. Empirical Performance and Application Scope

Vision: gMLP-b achieves top-1 accuracy of approximately 81.6% on ImageNet-1K, competitive with DeiT-B while using fewer parameters than alternatives like MLP-Mixer (Liu et al., 2021). CycleMLP-B2 matches this accuracy with only 3.9G FLOPs versus gMLP-B’s 15.8G, exemplifying the benefit of local mixing (Chen et al., 2021). S $^2$ -MLPv2 achieves 83.6% top-1 accuracy at 16.3B FLOPs with a split-attention, spatial-shift-based MLP block (Yu et al., 2021).
Tabular Data: GMLP architectures outperform traditional MLPs and regularized networks even as parameter count decreases. For example, Group-Connected MLP (GMLP) achieves state-of-the-art performance on permutation-invariant CIFAR-10 with substantial parameter savings (Kachuee et al., 2019). In large-scale financial data, TKGMLP’s gMLP backbone combined with quantile linear encoding (QLE) yields improved AUC and KS scores over tree-based models (Zhang et al., 3 Dec 2024).
Graph Learning: Feature-message passing gMLP outperforms APPNP and GAT on multiple node classification benchmarks and achieves up to 74.4 $\times$ speedups in large-scale distributed settings (Zhang et al., 2021).
Language: gMLP achieves masked language modeling perplexity similar to Transformers in BERT-like settings (e.g., 4.35 vs 4.26 for Transformer baseline) and is only marginally behind or equal on downstream NLP tasks, with model scale closing any remaining gap (Liu et al., 2021).

7. Theoretical and Algebraic Frameworks

The design and combination of gMLP and related architectures can be formalized via “MLP algebra” (Peng, 2017), which provides the mathematical tools (sum, difference, complement, product, and identical extension) for composing, decomposing, and aligning MLPs (and by extension, gMLPs) at the network level.
Such algebraic formulations inform modular construction, supporting, for example, the combination of variable-depth blocks and the design of complementary gating patterns.
Group-wise and split-attention mechanisms within gMLPs can be interpreted as extensions of these principles, enabling interpretable and modular representations (Kachuee et al., 2019).

8. Limitations and Open Directions

Fixed Mixing Patterns: The static structure of the spatial gating mechanism may underperform on tasks demanding adaptive cross-token alignment; introducing lightweight self-attention or dynamically parameterized gates can mitigate this shortfall (e.g., aMLP enhancements) (Liu et al., 2021).
Resolution Dependencies: Vanilla gMLP’s spatial projection matrix is tied to sequence/image size, complicating transfer to variable-sized inputs—a limitation specifically alleviated by hierarchical or cycle-based variants (Chen et al., 2021, Yu et al., 2021).
Locality Modeling: Early gMLP models lacked explicit local feature modeling, which is essential for dense prediction tasks (segmentation, detection). Window-free local mixing (Caterpillar) and convolutional front-ends (CgMLP) address this gap (Sun et al., 2023, Rajagopal et al., 2021).
Activation Functions: In inhibition-based gating (giMLP), the choice of activation (favoring functions with a short and smooth negative tail) is critical to maintaining effective gradient flow and robust regularization (Kang et al., 2022).

9. Summary

gMLP architectures and their extensions constitute a technically rich class of models that harness static but learnable feature or spatial mixing via gating—a concept generalizing and in some settings replacing dynamic attention. They offer competitive performance, resource efficiency, and scaling properties in vision, language, graph, and tabular domains. The design is underpinned by mathematical principles that promote compositionality and modularity, and is extensible both through architectural innovations (cycle, shift, attention fusion) and domain-specific pre/post-processing (e.g., quantile linear encoding for finance). These properties make gMLP and related MLP-based models significant both as an alternative and as a complement to classical attention-based architectures in contemporary deep learning research and deployment (Peng, 2017, Kachuee et al., 2019, Zhang et al., 2021, Liu et al., 2021, Chen et al., 2021, Yu et al., 2021, Rajagopal et al., 2021, Kang et al., 2022, Sun et al., 2023, Zhang et al., 3 Dec 2024).