Squeeze-Excitation MLP Architecture
- SE-MLP is a neural network module that unifies channel attention with multi-branch dense layers to enhance feature recalibration.
- It generalizes the conventional squeeze-and-excitation mechanism by employing parallel MLP branches, resulting in measurable gains with modest parameter increases.
- The architecture leverages residual connections and global context pooling to boost accuracy in tasks such as image classification, pose forecasting, and signal regression.
The squeeze and excitation multi-layer perceptron (SE-MLP) is a neural network architectural module that unifies channel attention and global context integration through multi-branch dense layers embedded within residual blocks. SE-MLP generalizes the original squeeze-and-excitation (SE) mechanism by expanding its bottleneck excitation via parallel MLP branches, thereby enriching channel-wise recalibration with minimal parameter overhead. This design has been deployed effectively in convolutional backbones for vision tasks, temporal modeling for sequential data, and regression problems with physical signals. Across these domains, SE-MLP consistently yields measurable performance improvements over standard SE, basic MLPs, and transformer alternatives.
1. Architectural Foundations and Block Integration
In convolutional architectures, SE-MLP is commonly inserted after the final convolution within a residual block. Consider a bottleneck block receiving :
- The block applies three convolutions:
- , (reduce)
- ,
- , (restore)
- After the last convolution, SE-MLP proceeds as follows:
- Global Average Pooling: , where
- Multi-Branch MLP: parallel two-layer FC networks independently process (see Section 2).
- Channel-Wise Gating: Outputs are summed, passed through sigmoid gating to generate .
- Scaling and Residual Addition: Channels of the residual features are reweighted, then added to the identity path: , where .
This "squeeze-and-excitation MLP" mechanism has also been adapted for sequential and tabular tasks. In regression contexts, such as predicting physical acceleration signals, the SE-MLP operates on 1D feature vectors: fully connected layers are augmented with SE blocks that perform attention-based rescaling per channel, with residual connections across blocks to stabilize gradient flow (Li et al., 29 Dec 2025).
2. Mathematical Formulation of Multi-Branch SE-MLP
The central operation in SE-MLP is the multi-branch excitation. For incoming pooled features :
Here, is ReLU (or GELU in non-vision domains), and is sigmoid (Narayanan, 2023, N, 2023, Li et al., 29 Dec 2025). Standard SE uses ; SE-MLP multiplies parameters and expressive capacity by . This formulation is used to rescale either spatial feature maps (vision), historical time steps (sequence modeling (Bouazizi et al., 2022)), or hidden dimensions (signal regression).
3. Gating Functions, Activations, and Residual Structures
Branch Activation: ReLU () is used in vision and bio-signal stacking; GELU is employed in pose forecasting and physical signal regression.
- Gating Mechanism: The final excitation vector is obtained by passing the aggregated branch outputs through a channel-wise sigmoid, enforcing values in for multiplicative rescaling.
- Attention Application: weights recalibrate their respective channels or feature dimensions.
- Residual Connection: Outputs of each block are added to their inputs, ensuring stable optimization and preserving global and low-level features (Li et al., 29 Dec 2025).
4. Global Context Flow and Representational Capacity
All global spatial statistics in visual architectures propagate through global average pooling. In sequential tasks (human pose), squeezing aggregates across spatial or temporal dimensions, enabling SE-MLP to recalibrate frames or steps through time-sensitive attention (Bouazizi et al., 2022). Multi-branch MLPs act as internal ensembles, each branch capturing distinct global interdependencies. Aggregation by summation prevents path-wise redundancy while boosting representational scope.
A plausible implication is that distributing the bottleneck excitation over multiple branches increases robustness to over-parameterization, as observed when yields diminishing returns (Narayanan, 2023).
5. Parameterization and Computational Complexity
Relative parameter and FLOP costs can be summarized as:
| Variant | Param/Block | FLOPs/Block | Typical , |
|---|---|---|---|
| SE | , | ||
| SE-MLP | , |
Despite a increase in excitation parameters, total model size grows by only $5$–, since convolutional layers dominate (Narayanan, 2023, N, 2023). In practical setups (e.g., ResNet-50 on CIFAR/Imagenet), the total parameter count increases from $24.9$M to $28.7$M, and FLOPs rise by $0.5$–$1$ GFLOPs. In signal regression, SE-MLP employs bottleneck ratios of (hidden dim ), still negligible compared to dense FC computation (Li et al., 29 Dec 2025).
6. Training Protocols and Empirical Performance
SE-MLP modules have been rigorously tested on diverse tasks:
- Vision Classification: On CIFAR-10/100 and downsized ImageNet, SE-MLP consistently improves top-1 accuracy by $0.5$– over vanilla SE-backbones. For example, SE-ResNetV2-50 achieves top-1 vs for SE-ResNet-50 (Narayanan, 2023). Comparable trends are reported for SaEnet and aggregated excitation approaches (N, 2023).
- Sequential Forecasting: In 3D human pose forecasting, inclusion of SE modules reduces MPJPE by mm at the $1000$ ms horizon versus MLP-only blocks (Bouazizi et al., 2022).
- Physical Signal Regression: For penetration acceleration features, SE-MLP reaches MAPE (peak) and (width), outperforming MLP, Transformers, and tree-based models by $15$– RMSE margins (Li et al., 29 Dec 2025).
Common hyperparameters include Adam or SGD optimizers, reduction ratios –$32$, and batch sizes $16$–$256$ per dataset requirements.
7. Ablation Studies and Module Optimization
Ablation analyses in all major reports address two axes: branch count and excitation placement. Increasing up to $4$ enhances accuracy; higher cardinalities may induce overfitting or stagnant gains (Narayanan, 2023, N, 2023). Placement of SE-MLP after the final convolution consistently outperforms pre-conv or split-phase integration.
Isolating the SE block and residual connections demonstrates each component's independent contribution: in regression, SE-MLP with residual achieves lower RMSE and higher over MLP+SE (Li et al., 29 Dec 2025). Similar trends hold in sequential attention ablations (Bouazizi et al., 2022).
8. Application Domains and Theoretical Rationale
SE-MLP has been adopted in image classification, sequential modeling, and tabular or signal regression. Its theoretical advantage lies in adaptive channel attention that simultaneously preserves global context and enables finer recalibration via multi-path excitation. Residual shortcuts ensure stable optimization and retention of low-level representations. The multi-branch structure provides increased representational richness per bottleneck at low computational cost.
In sum, SE-MLP generalizes SE by internal ensembling through parallel dense excitation, yielding robust improvements across tasks, backbones, and domains at moderate parameter increase (Narayanan, 2023, N, 2023, Bouazizi et al., 2022, Li et al., 29 Dec 2025).