FusionMLP: Efficient Model Fusion Methods

Updated 18 December 2025

FusionMLP is a set of methodologies that fuse MLP representations by integrating transformer embeddings with LLM scores to boost text classification accuracy.
It employs an NTK-preserving compression approach that clusters sub-MLP components, reducing model parameters by up to 4× with minimal accuracy loss.
The framework also utilizes optimal transport to compute Wasserstein barycenters for layer-wise fusion of MLP weights, enabling near-original performance without retraining.

FusionMLP refers to a set of methodologies for model fusion, compression, or signal integration in neural architectures, primarily centered on multi-layer perceptrons (MLPs). The term appears prominently in three distinct literature threads: (1) fusion heads in ensemble architectures for text classification tasks, (2) MLP fusion for model compression in LLMs by neural tangent kernel (NTK) approximation, and (3) optimal transport (OT) Wasserstein barycenter-based fusion of MLP weights or models. Each approach addresses a different class of problems but shares underlying principles of information fusion at the level of MLP representations or parameters.

1. FusionMLP as a Fusion Head in Hybrid Text Classification Pipelines

In LabelFusion, FusionMLP denotes the learned fusion “head” that integrates transformer-based classifier embeddings with LLM-derived per-class scores for text classification tasks (Schlee et al., 11 Dec 2025). The architecture is as follows:

The supervised backbone is a transformer classifier (e.g., RoBERTa) that produces a pooled hidden embedding $h\in\mathbb{R}^d$ .
Parallel to this, one or more LLM “scorers” supply a per-class score vector $s\in\mathbb{R}^c$ using prompt-engineering.
FusionMLP accepts the concatenated vector $x = [h; s] \in \mathbb{R}^{d+c}$ , mapping it through a configurable MLP.
For a configuration of $L$ layers (typically $L=3$ , 2 hidden + 1 output layer), the transformation is:

$\begin{align*} z^{(1)} &= \phi(W^{(1)}x + b^{(1)}) \in \mathbb{R}^{H_1} \ z^{(2)} &= \phi(W^{(2)}z^{(1)} + b^{(2)}) \in \mathbb{R}^{H_2} \ \mathrm{logits} &= W^{(3)}z^{(2)} + b^{(3)} \in \mathbb{R}^c \ \hat{y} &= \begin{cases} \mathrm{softmax}(\mathrm{logits}) & \text{multiclass} \ \sigma(\mathrm{logits}) & \text{multilabel} \end{cases} \end{align*}$

Default hyperparameters: [512, 256] hidden layer sizes, ReLU activation, 0.1 dropout, AdamW, learning rates $2 \times 10^{-5}$ (transformer) and $1 \times 10^{-3}$ (FusionMLP).
End-to-end training updates all parameters, though the MLP head is assigned a higher learning rate for rapid adaptation.

FusionMLP consistently yields accuracy gains over both the transformer and LLM-only baselines for AG News and Reuters-21578, especially in low-data regimes, where it dynamically up-weights the most reliable source (Schlee et al., 11 Dec 2025).

2. FusionMLP for One-Shot Compression of Transformer MLP Modules

In "MLP Fusion" (Ai et al., 2023), FusionMLP designates a technique for compressing the feedforward sublayers inside LLMs while preserving training dynamics, as measured by the NTK. The approach is agnostic to the fusion of model predictions and targets specifically the model weights.

Key steps:

Each MLP is decomposed into $p_I$ "rank-1" sub-MLPs, each parameterized by a concatenation $w_i = [W_1[:,i]^T, b_1[i], W_2[i,:]]^T \in \mathbb{R}^{2p+1}$ .
Embeddings $\{w_i\}$ are clustered via $K$ -means into $K$ centroids.
The MLP’s weight matrices are compressed by mapping the original parameters to centroid-weighted versions:

$\widetilde{W}_1 = W_1 \bar{C}^T, \quad \widetilde{b}_1 = \bar{C} b_1, \quad \widetilde{W}_2 = \bar{C} W_2$

where $\bar{C}$ is the normalized assignment matrix.

The compressed MLP (“FusionMLP”) uses a modified forward pass with an additional diagonal scaling matrix $P = CC^T$ :

$\tilde{H} = \sigma(X\widetilde{W}_1 + 1\widetilde{b}_1^T)P\widetilde{W}_2 + 1b_2^T$

This one-shot procedure yields a $4\times$ parameter and FLOP reduction when $K = p$ , with only $\sim1.3\%$ drop in validation accuracy on tasks like SST-2, and negligible effect on NTK approximation.

A central finding is the empirical preservation of the NTK and output, providing theoretical justification for the approach. Minimal post-compression tuning (one to several epochs) further recovers original-task accuracy (Ai et al., 2023).

3. Wasserstein Barycenter-Based FusionMLP for Model Weight Fusion

In "Wasserstein Barycenter-based Model Fusion and Linear Mode Connectivity of Neural Networks" (Akash et al., 2022), FusionMLP signifies a layer-wise, optimal-transport-based method for fusing the weights of multiple fully connected neural networks (MLPs) with identical architecture:

At each layer $\ell$ , the weight matrices from $K$ pretrained models, $W_i^{(\ell)}\in\mathbb{R}^{k_\ell\times k_{\ell-1}}$ , are interpreted as discrete empirical measures over their rows.
The fusion objective is to find a barycenter measure $\mu^*$ minimizing the sum of squared 2-Wasserstein costs to the $K$ input measures:

$\min_{\mu} \sum_{i=1}^K \lambda_i W_2^2(\mu, \mu_{W_i^{(\ell)}})$

The fusion alternates between:
- (i) Solving for optimal couplings $\Pi_i$ via entropic Sinkhorn regularization for each model,
- (ii) Updating the fused weights as barycentric averages of the neurons under these couplings:
$W^{*(\ell)} = k_\ell \frac{\sum_i \lambda_i \Pi_i W_i^{(\ell)}}{\sum_i \lambda_i \Pi_i \mathbf{1}\mathbf{1}^T}$
Neuron alignment between models is achieved by viewing each neuron as a function over the previous layer, permitting permutation and averaging in a principled, OT-optimal sense.
Empirical results establish that after OT-based permutation, the linear interpolation between different SGD solutions exhibits negligible increases in loss, affirming the linear-mode-connectivity conjecture.
The method is one-shot, requires no gradient-based retraining, and yields near-original levels of accuracy when fusing multiple MLPs (e.g., $97.92\%\pm 0.12$ MNIST accuracy when fusing two base MLPNet models) (Akash et al., 2022).

4. Comparative Overview of FusionMLP Methodologies

Approach	Primary Use Case	Core Mechanism
Fusion Head (LabelFusion)	Prediction fusion in ensemble	MLP on [CLS] + LLM per-class scores
NTK-preserving MLP compression (Ai et al., 2023)	Model size reduction, fine-tuning	Cluster sub-MLPs, reconstruct with centroids
Wasserstein barycenter fusion (Akash et al., 2022)	Full-model weight fusion	Layerwise OT alignment, barycenter averaging

Each path targets different problem classes. The FusionMLP head in LabelFusion is tailored explicitly for combining outputs at inference-time; the NTK-preserving method produces smaller, efficient models; the barycenter approach fuses entire models for distributed learning and averaging.

5. Implementation and Practical Considerations

The fusion head in (Schlee et al., 11 Dec 2025) is implemented in the AutoFusionClassifier API, with all architectural details (number of layers, activations, hidden sizes, normalization, dropout) exposed for user configuration.
The NTK-based FusionMLP (Ai et al., 2023) requires only extracting weight blocks, clustering by $K$ -means, and reconstructing via the provided formulas. For practical usage, fusing only the top transformer layers is recommended, with optional distillation for further performance recovery.
In the barycenter context (Akash et al., 2022), layerwise Sinkhorn-OT is computationally efficient for typical MLP widths. In practice, 10-100 iterations per OT step, with entropic regularization $\varepsilon \in [0.001, 0.01]$ , suffice for stable fusion. The method is compatible with both homogeneous and heterogeneous (width-varying) MLP networks.

6. Empirical Results and Limitations

FusionMLP consistently shows competitive or superior empirical fusion performance versus naive linear averaging or non-OT approaches. In the prediction-fusion context, accuracy metrics on AG News (92.4%) and Reuters-21578 (92.3%) surpass standalone RoBERTa or GPT-style LLMs for the same tasks (Schlee et al., 11 Dec 2025). In NTK-preserving compression, test accuracy remains within 1% of the full model on tasks like SST-2 and MNLI after lightweight tuning (Ai et al., 2023). For Wasserstein barycenter fusion, MNIST fusion accuracy matches or exceeds prior model fusion techniques (Akash et al., 2022).

A plausible implication is that FusionMLP mechanisms, especially those leveraging OT or explicit NTK objectives, offer robustness against the mismatch in learned representations and outperform non-aligned or simple averaging schemes. However, ablations targeting only the FusionMLP design parameters (hidden widths, activation, depth) are limited to codebase notebooks or partial reporting.

7. Theoretical Significance and Open Directions

FusionMLP, in its various incarnations, exemplifies the convergence of machine learning theory (neural tangent kernels, optimal transport, permutation algebras) with practical neural architecture design for ensemble learning, model compression, and distributed fusion. No global theoretical guarantees are established beyond empirical NTK and loss-surface results; formal proofs of full landscape connectivity or generalization under permutation-based fusion remain open. The interplay between neuron alignment, kernel preservation, and transferability across architectures continues to be a subject of ongoing research.