Papers
Topics
Authors
Recent
2000 character limit reached

Squeeze-Excitation MLP Architecture

Updated 5 January 2026
  • SE-MLP is a neural network module that unifies channel attention with multi-branch dense layers to enhance feature recalibration.
  • It generalizes the conventional squeeze-and-excitation mechanism by employing parallel MLP branches, resulting in measurable gains with modest parameter increases.
  • The architecture leverages residual connections and global context pooling to boost accuracy in tasks such as image classification, pose forecasting, and signal regression.

The squeeze and excitation multi-layer perceptron (SE-MLP) is a neural network architectural module that unifies channel attention and global context integration through multi-branch dense layers embedded within residual blocks. SE-MLP generalizes the original squeeze-and-excitation (SE) mechanism by expanding its bottleneck excitation via parallel MLP branches, thereby enriching channel-wise recalibration with minimal parameter overhead. This design has been deployed effectively in convolutional backbones for vision tasks, temporal modeling for sequential data, and regression problems with physical signals. Across these domains, SE-MLP consistently yields measurable performance improvements over standard SE, basic MLPs, and transformer alternatives.

1. Architectural Foundations and Block Integration

In convolutional architectures, SE-MLP is commonly inserted after the final convolution within a residual block. Consider a bottleneck block receiving X∈RC×H×WX \in \mathbb{R}^{C \times H \times W}:

  • The block applies three convolutions:
    • 1×11 \times 1, C→C/r1C \rightarrow C/r_1 (reduce)
    • 3×33 \times 3, C/r1→C/r1C/r_1 \rightarrow C/r_1
    • 1×11 \times 1, C/r1→CC/r_1 \rightarrow C (restore)
  • After the last convolution, SE-MLP proceeds as follows:

    1. Global Average Pooling: X→z∈RCX \rightarrow z \in \mathbb{R}^C, where zc=1HW∑i=1H∑j=1WXc,i,jz_c = \frac{1}{HW}\sum_{i=1}^H\sum_{j=1}^W X_{c,i,j}
    2. Multi-Branch MLP: BB parallel two-layer FC networks independently process zz (see Section 2).
    3. Channel-Wise Gating: Outputs are summed, passed through sigmoid gating σ(⋅)\sigma(\cdot) to generate g∈(0,1)Cg \in (0,1)^C.
    4. Scaling and Residual Addition: Channels of the residual features are reweighted, then added to the identity path: Y=X~+XY = \widetilde{X} + X, where X~c,i,j=gcXc,i,j\widetilde{X}_{c,i,j} = g_c X_{c,i,j}.

This "squeeze-and-excitation MLP" mechanism has also been adapted for sequential and tabular tasks. In regression contexts, such as predicting physical acceleration signals, the SE-MLP operates on 1D feature vectors: fully connected layers are augmented with SE blocks that perform attention-based rescaling per channel, with residual connections across blocks to stabilize gradient flow (Li et al., 29 Dec 2025).

2. Mathematical Formulation of Multi-Branch SE-MLP

The central operation in SE-MLP is the multi-branch excitation. For incoming pooled features z∈RCz \in \mathbb{R}^{C}:

Branch b∈{1…B}: ab=W1,bz∈RC/rub=δ(ab)∈RC/r vb=W2,bub∈RC s=∑b=1Bvb∈RC g=σ(s)∈(0,1)C\begin{aligned} &\text{Branch } b \in \{1 \ldots B\}: \ &a_b = W_{1,b} z \in \mathbb{R}^{C/r}\qquad u_b = \delta(a_b) \in \mathbb{R}^{C/r}\ &v_b = W_{2,b} u_b \in \mathbb{R}^{C}\ &s = \sum_{b=1}^B v_b \in \mathbb{R}^C\ &g = \sigma(s) \in (0,1)^C \end{aligned}

Here, δ\delta is ReLU (or GELU in non-vision domains), and σ\sigma is sigmoid (Narayanan, 2023, N, 2023, Li et al., 29 Dec 2025). Standard SE uses B=1B=1; SE-MLP multiplies parameters and expressive capacity by BB. This formulation is used to rescale either spatial feature maps (vision), historical time steps (sequence modeling (Bouazizi et al., 2022)), or hidden dimensions (signal regression).

3. Gating Functions, Activations, and Residual Structures

  • Branch Activation: ReLU (δ\delta) is used in vision and bio-signal stacking; GELU is employed in pose forecasting and physical signal regression.

  • Gating Mechanism: The final excitation vector is obtained by passing the aggregated branch outputs through a channel-wise sigmoid, enforcing values in (0,1)(0,1) for multiplicative rescaling.
  • Attention Application: gcg_c weights recalibrate their respective channels or feature dimensions.
  • Residual Connection: Outputs of each block are added to their inputs, ensuring stable optimization and preserving global and low-level features (Li et al., 29 Dec 2025).

4. Global Context Flow and Representational Capacity

All global spatial statistics in visual architectures propagate through global average pooling. In sequential tasks (human pose), squeezing aggregates across spatial or temporal dimensions, enabling SE-MLP to recalibrate frames or steps through time-sensitive attention (Bouazizi et al., 2022). Multi-branch MLPs act as internal ensembles, each branch capturing distinct global interdependencies. Aggregation by summation prevents path-wise redundancy while boosting representational scope.

A plausible implication is that distributing the bottleneck excitation over multiple branches increases robustness to over-parameterization, as observed when B>4B>4 yields diminishing returns (Narayanan, 2023).

5. Parameterization and Computational Complexity

Relative parameter and FLOP costs can be summarized as:

Variant Param/Block FLOPs/Block Typical BB, rr
SE 2C2/r2 C^2 / r 2C2/r2 C^2 / r B=1B=1, r=32r=32
SE-MLP 2BC2/r2B C^2 / r 2BC2/r2B C^2 / r B=4B=4, r=32r=32

Despite a 4×4\times increase in excitation parameters, total model size grows by only $5$–10%10\%, since convolutional layers dominate (Narayanan, 2023, N, 2023). In practical setups (e.g., ResNet-50 on CIFAR/Imagenet), the total parameter count increases from $24.9$M to $28.7$M, and FLOPs rise by $0.5$–$1$ GFLOPs. In signal regression, SE-MLP employs bottleneck ratios of r=8r=8 (hidden dim D=128D=128), still negligible compared to dense FC computation (Li et al., 29 Dec 2025).

6. Training Protocols and Empirical Performance

SE-MLP modules have been rigorously tested on diverse tasks:

  • Vision Classification: On CIFAR-10/100 and downsized ImageNet, SE-MLP consistently improves top-1 accuracy by $0.5$–1.3%1.3\% over vanilla SE-backbones. For example, SE-ResNetV2-50 achieves 63.52%63.52\% top-1 vs 62.26%62.26\% for SE-ResNet-50 (Narayanan, 2023). Comparable trends are reported for SaEnet and aggregated excitation approaches (N, 2023).
  • Sequential Forecasting: In 3D human pose forecasting, inclusion of SE modules reduces MPJPE by ∼2\sim2mm at the $1000$ ms horizon versus MLP-only blocks (Bouazizi et al., 2022).
  • Physical Signal Regression: For penetration acceleration features, SE-MLP reaches MAPE 8.46%8.46\% (peak) and 4.46%4.46\% (width), outperforming MLP, Transformers, and tree-based models by $15$–30%30\% RMSE margins (Li et al., 29 Dec 2025).

Common hyperparameters include Adam or SGD optimizers, reduction ratios r=8r=8–$32$, and batch sizes $16$–$256$ per dataset requirements.

7. Ablation Studies and Module Optimization

Ablation analyses in all major reports address two axes: branch count BB and excitation placement. Increasing BB up to $4$ enhances accuracy; higher cardinalities may induce overfitting or stagnant gains (Narayanan, 2023, N, 2023). Placement of SE-MLP after the final 1×11 \times 1 convolution consistently outperforms pre-conv or split-phase integration.

Isolating the SE block and residual connections demonstrates each component's independent contribution: in regression, SE-MLP with residual achieves 12%12\% lower RMSE and 3.4%3.4\% higher R2R^2 over MLP+SE (Li et al., 29 Dec 2025). Similar trends hold in sequential attention ablations (Bouazizi et al., 2022).

8. Application Domains and Theoretical Rationale

SE-MLP has been adopted in image classification, sequential modeling, and tabular or signal regression. Its theoretical advantage lies in adaptive channel attention that simultaneously preserves global context and enables finer recalibration via multi-path excitation. Residual shortcuts ensure stable optimization and retention of low-level representations. The multi-branch structure provides increased representational richness per bottleneck at low computational cost.

In sum, SE-MLP generalizes SE by internal ensembling through parallel dense excitation, yielding robust improvements across tasks, backbones, and domains at moderate parameter increase (Narayanan, 2023, N, 2023, Bouazizi et al., 2022, Li et al., 29 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Squeeze and Excitation Multi-Layer Perceptron (SE-MLP).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube