MLP Gate & Up Tuning: Efficient Neural Adaptation

Updated 13 October 2025

MLP Gate & Up Tuning is a model adaptation strategy that selectively updates the gating and up-projection components to improve task learning while limiting output drift and catastrophic forgetting.
It employs gradient masking to freeze the down-projection layers, achieving near-maximal performance gains with minimal forgetting as validated by empirical studies.
The approach extends to quantum circuits and structured sparsity in mixer architectures, offering a versatile and parameter-efficient fine-tuning method for diverse neural models.

MLP Gate & Up Tuning refers to a set of targeted model adaptation strategies that selectively update the gating and up-projection components within multilayer perceptron (MLP) blocks of neural architectures, particularly transformers and large multimodal models (LMMs). Instead of conventional fine-tuning, which modifies all parameters or entire projection matrices, Gate & Up tuning restricts updates to specific parameter subsets—those responsible for activating and transforming features—while freezing the down-projection or output re-projection layers. This approach aims to achieve strong learning of new tasks, minimize output distribution drift, and limit catastrophic forgetting on held-out tasks. Recent works extend these principles to quantum neural network circuits, sparse-structured MLPs, and efficient fine-tuning methods.

1. Architectural Definitions and Theoretical Foundations

In transformer MLP modules, weights typically partition into a gating mechanism (W_gate), an up-projection (W_up), and a down-projection (W_down). Formally, a feed-forward block acts as

$\mathbf{y} = W_\text{down} \cdot f(\mathbf{x}; W_\text{gate}, W_\text{up}),$

where $W_\text{gate}$ and $W_\text{up}$ modulate activation and increase feature dimension, and $W_\text{down}$ brings the representation back to the residual stream. "Gate & Up" tuning freezes $W_\text{down}$ , updating only $W_\text{gate}$ and $W_\text{up}$ . This targeted approach is predicated on findings that drift in $W_\text{down}$ significantly shifts the output token distribution—e.g., biasing numeric token occurrence and causing forgetting (Zhu et al., 9 Oct 2025).

Analogous gate-and-projection mechanisms exist in quantum circuit architectures, where CRX-gate parameterization enables adaptive entanglement, and in classical sparse MLPs, where Kronecker factorization yields naturally gated, sparse connectivity structures (Hayase et al., 2023).

2. Selective Tuning: Methodologies and Empirical Outcomes

Gate & Up tuning is implemented by masking parameter gradients or optimizer states such that only $W_\text{gate}$ and $W_\text{up}$ are trainable. Empirically, large multimodal models using this method demonstrate robust sequential learning. As reported, sequential adaptation to five skills yields target gains of +30.5 points (relative improvement) with only –4.2 points of forgetting on held-out benchmarks, whereas full MLP tuning achieves a marginally higher gain (+31.1) but with dramatically more forgetting (–15.7) (Zhu et al., 9 Oct 2025).

In quantum MLP modules (QMLP), error-tolerant input embedding is realized with single-qubit RX gates (gates per qubit, not global entanglement), ensuring local gradient flows and suppressed error propagation. Nonlinearity arises through re-uploading units constructed to modulate quantum state encodings via nonlinear activation pairs (e.g., RX(ReLU(x))), which further exemplifies selective up-projection tuning (Chu et al., 2022).

3. Structured Sparsity and Gating in Mixer Architectures

The principle of structured gate and up-projection tuning generalizes to nontransformer architectures. In MLP-Mixer models, the “wide-sparse” MLP’s weight tensor is effectively parameterized as a Kronecker product:

$\text{vec}(W X V) = (V^T \otimes W) \cdot \text{vec}(X),$

where gated mixing occurs in block-sparse regions. Implicit sparse regularization ensues, as training over the smaller factor matrices ( $V, W$ ) is mathematically linked to $\ell_1$ (sparse) regularization of the full matrix. Tuning is thus focused on the factor matrices (i.e., “gates”)—preserving desired up-projection sparsity and model generalization (Hayase et al., 2023).

4. Parameter-Efficient Fine-Tuning: Compression and Sparse Mechanisms

Recent selective fine-tuning algorithms, including MLP Fusion (Ai et al., 2023) and SparseGrad (Chekalina et al., 9 Oct 2024), further refine Gate & Up principles.

MLP Fusion decomposes the feed-forward block into a sum of sub-MLPs; these are clustered in parameter space (via k-means) and fused into centroids. This process compresses the up-projection space while maintaining the neural tangent kernel (NTK):

$\text{NTK}(x,z) = \langle \nabla_\theta f(x;\theta), \nabla_\theta f(z; \theta) \rangle,$

ensuring that training dynamics and output similarity are preserved under fusion. This selective procedure is directly comparable to updating gate/up parameters but in a compressed, centroided basis.

SparseGrad identifies a sparse basis for MLP gradients using higher-order SVD, and transforms parameter updates such that only ~1% of the up-projection weights are updated while all down-projections are frozen. This yields competitive or superior downstream performance on tasks such as GLUE (BERT, RoBERTa) and question answering (LLaMa-2), surpassing regular full fine-tuning and LoRA/MeProp with similar memory efficiency (Chekalina et al., 9 Oct 2024).

5. Practical Considerations and Comparative Effectiveness

Gate & Up tuning strikes a balance between adaptation efficiency and generalization retention. Its practical advantages can be summarized as:

Tuning Method	Target Gain	Forgetting	Output Drift	Computational Cost
Full MLP/LLM Tuning	Maximal	High	High	High
MLP Gate & Up Tuning	Near-maximal	Low-Moderate	Low	Moderate
Self-Attention Proj. Tuning	Substantial	Minimal	Minimal	Moderate
Compression/Sparse Methods	High	Low	Low	Low-Moderate

Gate & Up avoids the installation of external adapters, rehearsal buffers, or extra distilled knowledge. By restricting updates to the up-projection path, models can learn new skills sequentially while preserving broad behavior. In quantum MLPs, error-tolerant embedding and gate adaptivity also minimize hardware-induced output drift, facilitating deployment on NISQ platforms (Chu et al., 2022).

6. Applications, Limitations, and Future Directions

This tuning approach has practical applications in continual learning, medical AI, edge adaptation, quantum visual classification, and parameter-efficient LLM deployment. For instance, LMMs adapted in diagnostic tasks do not sacrifice their ability to interpret general images or texts. In quantum settings, scalable, robust QMLP architectures are preferable for NISQ devices with high error rates.

Limitations include additional memory and computation during sparse basis selection (e.g., in SparseGrad for large models), and the potential for modest performance drops if up-projection parameterization is too aggressive or if centroids are insufficiently representative. There is also the practical tradeoff that self-attention-only tuning may sometimes yield greater generality preservation than even Gate & Up.

Future avenues include optimizing sparse basis construction, dynamic gating schedules, and hybrid tuning approaches (combining Gate & Up with attention or adapter-based methods). Quantum circuit designs may benefit from further RUU enhancements and entanglement parameterization. A plausible implication is that fine-grained component selection in architecture-specific gate/up tuning will enable increasingly resilient, adaptive, and resource-aware neural models.

7. Code Availability

Code implementing "MLP Gate & Up" tuning experiments in LMMs is available at https://github.com/jessemelpolio/LMM_CL. Implementations of MLP Fusion and SparseGrad can be found at https://github.com/weitianxin/MLP_Fusion.

In summary, MLP Gate & Up Tuning constitutes a targeted, empirically validated approach for continual and efficient model adaptation across both classical and quantum neural architectures, yielding strong task learning and generalization stability by selectively updating feature activation and transformation pathways.