Papers
Topics
Authors
Recent
Search
2000 character limit reached

gMLP: Gated Multilayer Perceptron

Updated 7 March 2026
  • gMLP is a deep learning model that replaces traditional self-attention with a spatial gating mechanism embedded in an MLP block.
  • It achieves efficient cross-token interaction with lower computational cost, yielding competitive image classification and language understanding results.
  • Variants like CgMLP introduce convolutional operations to capture local spatial features and improve generalization on tasks such as CIFAR-100.

Gated Multilayer Perceptron (gMLP) is a deep learning architecture that incorporates a spatial gating mechanism within a multilayer perceptron (MLP) block, explicitly designed to replace or augment self-attention in transformer models for both vision and language tasks. The gMLP block achieves cross-token interaction without the computational complexity of self-attention and demonstrates competitive performance in large-scale classification and natural language understanding settings. Key variants and extensions—such as convolutional gMLPs (CgMLPs) and hybrids with attention—advance this family of models while providing insights into parameter efficiency, inductive biases, and scalability (Liu et al., 2021, Rajagopal et al., 2021).

1. Architecture and Mathematical Formulation

The core of gMLP is the replacement of the traditional self-attention block with a Spatial Gating Unit (SGU) inside an MLP-based residual block. Consider an input matrix XRn×dX \in \mathbb{R}^{n \times d} (where nn is the number of tokens/patches and dd is the hidden dimension). The standard gMLP block proceeds through the following stages (Liu et al., 2021):

  • Layer Normalization: X^=LayerNorm(X)X̂ = \mathrm{LayerNorm}(X)
  • Channel Expansion: Z=X^UZ = X̂ U, URd×eU \in \mathbb{R}^{d \times e} (ede \gg d)
  • Nonlinearity: Z=GeLU(Z)Z' = \mathrm{GeLU}(Z)
  • Split: Divide ZZ' into [Z1,Z2][Z_1, Z_2], each Rn×(e/2)\in \mathbb{R}^{n \times (e/2)}
  • Spatial Gating Unit (SGU):
    • Normalize gating stream: V=LayerNorm(Z2)V = \mathrm{LayerNorm}(Z_2)
    • Linear spatial projection (typically using an n×nn \times n matrix): G=VW+bG = V W^\top + b
    • Element-wise gating: Z~=Z1G\tilde{Z} = Z_1 \odot G
  • Channel Reduction: Y=Z~VY = \tilde{Z} V', VRe/2×dV' \in \mathbb{R}^{e/2 \times d}
  • Residual Connection: Output=X+Y\mathrm{Output} = X + Y

This spatial gating—multiplying one stream with a spatially-mixed transformation of the other—constitutes the fundamental means by which information is propagated across tokens, without requiring explicit self-attention (Liu et al., 2021, Rajagopal et al., 2021).

2. Spatial Gating Unit: Mechanisms and Role

The SGU is the sole component responsible for token-to-token interaction. The gating is typically realized via a learnable projection across all spatial (token) positions:

s(Z)=Z1(VW+b)s(Z) = Z_1 \odot (V W^\top + b)

with WRn×nW \in \mathbb{R}^{n \times n}, bRn×(e/2)b \in \mathbb{R}^{n \times (e/2)}. Initialization uses near-zero WW and unit bb, ensuring that spatial mixing is introduced progressively during training. The SGU can be ablated: linear, additive, or multiplicative forms have been tested, but the multiplicative split approach yields the highest accuracy and best matches transformer perplexity on masked language modeling benchmarks (Liu et al., 2021).

This gating mechanism allows static spatial mixing, mimicking aspects of convolution and attention, while maintaining parameter and computational efficiency.

3. Complexity, Parameters, and Scaling Properties

gMLP blocks exhibit favorable parameter and computational scaling compared to standard transformers. For input length nn and hidden size dd (Liu et al., 2021):

  • gMLP Block: O(nde+n2e)O(nd e + n^2 e) multiply-adds per layer
  • Transformer Block: O(nd2+n2d)O(nd^2 + n^2 d) per layer

For typical configurations (with e4de \approx 4d), both blocks scale quadratically with sequence length, but gMLP's quadratic term is roughly half that of self-attention when ede \approx d. Parameter counts per block are:

  • gMLP: 2de+(n2/2)2d e + (n^2/2) (where n2/2n^2/2 from the SGU)
  • Transformer: 6d26d^2

Empirical results demonstrate that gMLP scales similarly to transformers under increases in data, depth, and model size. Pretraining perplexity on masked language modeling and top-1 accuracy on ImageNet track the scaling laws previously established for transformer-based architectures (Liu et al., 2021).

4. Extension: Convolutional Gated MLP (CgMLP)

CgMLP extends gMLP by introducing convolutional operators into both the input stage and within the gating mechanism (Rajagopal et al., 2021). The modifications are:

  • Front-end CNN Backbone: One or two 5×55\times 5 2D convolutional layers (with 64 filters, stride 2, ReLU, and max-pooling) preprocess the raw image, yielding a 16×16×6416 \times 16 \times 64 feature map from 256×256256 \times 256 RGB input.
  • Convolutional Gating: Inside the gMLP block, the dense spatial projection WsW_s is replaced by a 1×51\times 5 convolution along spatial dimensions, enforcing locality and weight sharing.
  • Overall Block:

1
2
3
4
5
6
7
def CgMLP_Block(X):
    Z = Dense_expand(X)
    X1, X2 = split(Z, axis=-1)
    G = Conv1x5(X2)
    S = X1 * G
    Y = Dense_contract(S)
    return X + Y

These changes provide additional inductive bias, favoring the learning of local spatial patterns, and substantially alleviate overfitting on small datasets. For instance, on CIFAR-100, CgMLP variants (e.g., 1CNN-4gMLP) outperform or match the original gMLP, continuing to improve with more epochs, and display better generalization in comparison to the standard gMLP that tends to overfit early (Rajagopal et al., 2021).

5. Empirical Results and Comparative Performance

gMLP achieves ImageNet and masked language modeling performance comparable to transformer-based baselines:

  • Vision: gMLP-B (73M parameters) achieves 81.6% top-1 on ImageNet, matching DeiT-B (86M, 81.8%). gMLP-S (20M) attains 79.6% (comparable to ResMLP-24 at 79.4%) (Liu et al., 2021).
  • Language: gMLP nearly matches BERT in pretraining perplexity and finetuning on various GLUE tasks. On SQuAD v1.1, gMLP_large reaches 89.5% F1 versus BERT_base at 88.5%. Increasing model size narrows the gap on more challenging sentence-pair tasks. Augmenting gMLP with a “tiny-attention” head (aMLP) can recover or surpass transformer performance on tasks benefiting from dynamic token alignment (Liu et al., 2021).
Model Params (M) ImageNet Top-1 (%) SQuAD v1.1 F1 (%)
gMLP-B 73 81.6
DeiT-B 86 81.8
gMLP_large 368 89.5
BERT_base 110 88.5

CgMLP on CIFAR-100 further demonstrates that incorporating convolutions increases sample efficiency and spatial localization, leading to improved generalization. Specifically, 1CNN-4gMLP achieves 0.80 top-1 validation accuracy on CIFAR-100 versus 0.79 for the vanilla gMLP-4gMLP (Rajagopal et al., 2021).

6. Inductive Bias, Visualization, and Interpretability

gMLP’s SGU imparts global spatial modeling but lacks explicit inductive biases for locality typically provided by convolutions. Visual analysis of activations in CgMLP shows that learned feature maps focus on salient object regions and outlines (e.g., flowers, animal faces), with crisp contours indicating local pattern formation (Rajagopal et al., 2021). This suggests convolutional gating enhances spatial structure capture, which vanilla gMLP cannot isolate as cleanly.

gMLP’s learned spatial filters (when examined post-training) display locality and shift-invariance properties similar to data-driven convolutions, confirming that the SGU is capable of modeling useful spatial dependencies if sufficiently regularized and trained at scale (Liu et al., 2021).

7. Limitations, Applications, and Variant Architectures

Despite competitive benchmarks, gMLP exhibits:

  • Minimal inherent inductive bias: Its dense gating must learn all spatial structure, leading to overfitting in low-data regimes and absence of local priors (edges, textures) (Rajagopal et al., 2021).
  • Patch-based constraints: Processing fixed-size patches may force relearning of low-level features, which convolutions provide efficiently.

Research has demonstrated that hybrid architectures (e.g., aMLP, CgMLP) can mitigate these issues through the addition of lightweight attention modules or CNN-based gating. gMLP and its variants are applicable to both large-scale vision and language pretraining, with transfer learning and finetuning tasks on datasets such as ImageNet, SQuAD, and CIFAR.

The wider implication is that explicit self-attention is not strictly necessary for scalable transformer-style models. Given appropriate capacity and regularization, gating-based architectures can match or surpass self-attention for a variety of tasks—although tiny attention heads or added convolutional priors may be required for challenging scenarios involving alignment or small data (Liu et al., 2021, Rajagopal et al., 2021).


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Multilayer Perceptron (gMLP).