Reduced MLP (RMLP): Efficient MLP Architectures
- RMLP is a family of MLP-based networks optimized by reducing parameters and computational complexity while retaining predictive capacity.
- Techniques include gradient-based pruning with self-distillation for language models and efficient token mixing in vision MLPs using relative positional encodings.
- Applications in signal processing deploy FIR-inspired architectures, balancing reduced hardware costs with maintained processing accuracy.
A reduced MLP (RMLP) denotes any multilayer perceptron (MLP) or MLP-based network whose structure and/or parameterization has been systematically altered to lower the parameter count, memory footprint, or computational complexity, while striving to preserve most of its original predictive capacity. RMLP approaches span pruning techniques in LLMs, relative positional encoding–driven reductions in vision MLPs, and FIR-inspired low-complexity architectures for signal processing. The term thus encapsulates a family of architectures or strategies that substitute or constrain MLP layers to yield a more resource-efficient model.
1. MLP Pruning in LLMs
The Reduced-MLP (RMLP) regime has emerged as a structural pruning strategy for large transformer-based LLMs, with particular attention paid to feed-forward (MLP) sublayers, which account for a substantial fraction (typically 70–80%) of total parameters. In models such as LLaMA-3.2-1.2B, the three-layer MLP blocks exceed five times the parameter count of attention heads. Sensitivity analyses using first-order Taylor expansion reveal that the average importance of MLP neurons is markedly lower than attention parameters, which motivates targeting MLPs for pruning to maximize compression with minimal degradation in generative performance (Zhu et al., 10 Jun 2025).
Gradient-based pruning is performed in two stages: (1) a cold-start stage using hard-label cross-entropy loss gradients to identify an initial subset of important neurons, and (2) a self-distillation phase employing a composite loss
where is the standard one-hot loss and is the Kullback–Leibler divergence between the teacher (pre-pruned) and student (pruned) output distributions (softmax at temperature ). The incorporation of the self-distillation term exposes the gradient computation to all plausible tokens, enriching the parameter importance signals.
After neuron scoring, a binary mask is computed for the up-projection, gated activation, and down-projection matrices: $M_L^{(i)} = \begin{cases} 1 & \text{if %%%%3%%%% is among the %%%%4%%%% most important neurons} \ 0 & \text{otherwise} \end{cases}$ which is applied synchronously after both stages. Typically, a uniform pruning ratio per MLP block is adopted.
2. Efficient Token-Mixing in Vision MLPs
Quadratic-complexity token-mixing is a bottleneck in standard spatial gating units (SGUs) for vision MLPs. The Positional Spatial Gating Unit (PoSGU) replaces the -parameter token-mixing matrix with parameter-efficient modules based on relative positional encoding (RPE). Two key variants are employed (Wang et al., 2022):
- Learnable RPE (LRPE): The dense weight is replaced by a learned table covering all 2D position offsets , yielding parameters.
- Generalized Quadratic PE (GQPE): A fixed-number (6) of learned scalars represent a non-isotropic Gaussian over spatial offsets. Channel grouping extends the model to multi-granular contexts, with 0 parameters for 1 groups, i.e., still 2 if 3.
These constructions reduce the overall parameter complexity of the token-mixing layer typically from 4 to 5 or even 6. When these PoSGU blocks are integrated into the PosMLP backbone, various architecture depths and widths are realized with substantial reductions in memory and computation.
3. Reduced Complexity MLPs in Signal Processing
In high-throughput signal processing, as in equalization for two-dimensional magnetic recording (TDMR), reduced complexity MLP (RC-MLP) architectures substitute large dense weight matrices with combinations of finite-impulse response (FIR) filters, element-wise nonlinearity, and hidden delay lines (Aboutaleb et al., 2021).
Four RC-MLP variants exist:
- RC-MLP1: Parallel FIRs per input stream (head), local nonlinearity, multiple delay lines, and FIR post-combination.
- RC-MLP2: FIR summation across heads, single nonlinearity and delay line, single post-FIR.
- RC-MLP3: Like RC-MLP2, but with a parallel skip (linear) path added to the output.
- RC-MLP4: Two-branch version of RC-MLP3 with linear skip per branch.
The key reduction is replacing a fully-connected 7 matrix (with 8) by a set of FIRs and a few scalar biases and mixture coefficients, while maintaining nonlinearity via tanh activation.
4. Quantitative Benchmarks and Parameter Savings
Representative metrics and results for model compression approaches are summarized below.
LLM Pruning (LLaMA-3.2-1.2B, LoRA finetuned)
| Pruning Ratio | Model | Perplexity ↓ | Zero-Shot Accuracy ↑ |
|---|---|---|---|
| 0% | Baseline | 12.98 | 51.47 |
| 20% | RMLP | 26.96 | 48.94 |
| 30% | RMLP | 39.70 | 46.40 |
| 40% | RMLP | 70.12 | 44.21 |
RMLP pruning achieves superior perplexity and accuracy trade-offs compared to prior magnitude-based and structured pruning approaches (Zhu et al., 10 Jun 2025).
Vision MLPs (ImageNet, gMLP-S/PosMLP-T)
| Variant | Params | Top-1 Accuracy (%) | Token-Mixing Complexity |
|---|---|---|---|
| gMLP-S (SGU) | 20M | 79.6 | 9 |
| gMLP-S + LRPE | 020M | 79.9 | 1 |
| gMLP-S + GQPE | 18.2M | 74.02 | 2 |
| PosMLP-T | 21M | 82.1 | 3 |
On a 50% ImageNet subset, gMLP-S + GQPE raises top-1 accuracy from 72.14% (vanilla) to 74.02% with a simultaneous reduction in parameters (Wang et al., 2022).
RC-MLP Equalizers (TDMR, Real HDD Data)
| Equalizer | Params | Rel. Complexity | BER | BER Reduction vs. LE |
|---|---|---|---|---|
| Full MLP | 145 | 6.6× | 0.02056 | 10.91% |
| RC-MLP3 | 35 | 1.59× | 0.02115 | 8.23% |
| RC-MLP2 | 34 | 1.54× | 0.02162 | 6.06% |
| 2D-LECE | 22 | 1.0× | 0.02302 | – |
RC-MLP3 achieves a substantial BER reduction at only 1.59× the complexity of a linear equalizer (Aboutaleb et al., 2021).
5. Architectural and Methodological Underpinnings
Key architectural and methodological principles underlying RMLP approaches are:
- Structural Pruning: Neuron importance is estimated using Taylor expansions of the loss with respect to MLP weights. Binary masks are applied per layer to enforce sparsity, and uniform per-layer ratios often suffice.
- Self-Distillation: Pruned models are trained in a teacher-student regime, leveraging KL-divergence on output distributions to boost the informativeness of pruning gradients.
- Relative Positional Encoding: Vision RMLPs use position-dependent parameter tables or analytic forms to encode locality and non-locality, dramatically lowering the parameter count of token-mixing transformations.
- FIR-Derived Architectures: In signal processing, RMLPs implement FIR-based frontends, localized nonlinearity, and short temporal buffers rather than dense MLPs, thus minimizing hardware burden and retaining most nonlinear processing capacity.
6. Application Domains and Trade-Offs
RMLP frameworks are prominent across:
- Language Modeling: For LLM compression, MLP-only pruning yields better parameter reduction per accuracy loss than attention pruning, as MLP parameters exhibit lower average salience per parameter.
- Vision: Complexity reduction in token-mixing permits larger spatial windows, deeper models, or finer-scale grouping under constant resource constraints.
- Signal Processing: RC-MLPs deliver the bulk of MLP-driven BER improvements with modest hardware cost increments, well-suited to resource-limited ASIC/FPGA deployments.
A recurring trade-off is between absolute parameter reduction and functional fidelity. Gradual sparsity (20–30%) often incurs only minor loss in zero-shot accuracy or BER, but aggressive sparsity (40%+) may trigger steep accuracy degradation. Hardware friendliness and ease of implementation also improve as models shift to groupwise, FIR, or analytic-parameter constructions.
7. Limitations and Prospects
RMLP strategies may require additional memory for parallel teacher-student inference (as in self-distillation), and the choice of distillation coefficients (4, 5) impacts the final trade-off. In pruning contexts, allocation of sparsity budgets remains mostly uniform, suggesting room for layer-adaptive or bilevel optimization schema. Lighter-weight surrogate teachers or compressed teacher models may further lower in-loop resource demands. In vision, the expressive bounds of 6 or groupwise token-mixing architectures for very large images or long sequences are open to further scrutiny.
RMLPs thus constitute a concerted family of parameter and complexity reduction approaches that are domain-specific in their design yet unified by the goal of achieving maximal efficiency with minimal cost to accuracy or deployability (Zhu et al., 10 Jun 2025, Wang et al., 2022, Aboutaleb et al., 2021).