Model Souping: Weight Averaging in Neural Networks

Updated 26 September 2025

Model souping is a technique that aggregates weights of independently fine-tuned models to create a single model with enhanced generalization and robustness.
It employs methods like uniform, greedy, and learned averaging to optimize model performance by leveraging properties of the loss landscape and prediction confidence.
The approach extends to modular adapters and multi-domain models, achieving significant improvements in accuracy, efficiency, and scalability across various applications.

Model souping is a methodology for synthesizing a single neural network from multiple models—typically fine-tuned or independently trained from a common initialization—by aggregating their parameters (often via averaging). This technique offers the predictive benefits of ensemble methods without incurring increased memory or inference costs, and has demonstrated utility across domains such as vision, language, speech, and generative modeling. The principle is grounded in the observation that, under certain conditions, independently fine-tuned models occupy a shared basin in the loss landscape, allowing for their parameters to be merged into a model with improved generalization and robustness. Model souping frameworks have been extended to modular adapters, graph neural networks, SSM hidden states, objective function composition, and amortized meta-optimization. Recent work explores hierarchical, resource-efficient, and adaptive souping algorithms, and investigates theoretical connections between weight averaging, ensemble predictions, and landscape geometry.

1. Core Principles and Mechanisms

At its foundation, model souping refers to constructing a single model by aggregating the weights of several fine-tuned models rather than selecting only the “best” checkpoint from a hyperparameter sweep (Wortsman et al., 2022). The process is typically performed post-hoc; models are trained independently (differing in seeds, hyperparameters, augmentation, or domains), and their weights are then combined by either uniform averaging, greedy selection, or learned mixing coefficients. Uniform souping computes the arithmetic mean of parameter tensors:

$\theta_{soup} = \frac{1}{k} \sum_{i=1}^k \theta_i$

where $\theta_i$ denotes each model’s weights. Greedy souping adds models sequentially if their inclusion improves held-out validation performance, mitigating negative interference from poorly aligned checkpoints. Learned souping frames the aggregation as an optimization problem over coefficients $\alpha_i$ with $\sum_i \alpha_i = 1$ , minimizing a surrogate loss on a calibration set. These recipes achieve ensemble-like performance—improved accuracy and robustness—but inference cost remains equivalent to a single model.

The mathematical justification is provided by analysis of loss landscape flatness and confidence. Given interpolation between weights $\theta_0$ and $\theta_1$ :

$\theta_\alpha = (1-\alpha)\theta_0 + \alpha\theta_1$

the difference between weight-averaged and logit-ensembled loss

$L_\alpha - L_{\alpha,(ens)} \approx \frac{\alpha(1-\alpha)}{2} \left[-\frac{d^2}{d\alpha^2} L_\alpha + \beta^2 \mathbb{E}_{x} \mathrm{Var}_{Y \sim \mathrm{softmax}(\beta f(x; \theta_\alpha))}( \Delta f_Y(x) )\right]$

shows that, in regions of sufficient loss flatness and confident predictions, the soup approximates ensembling without added compute (Wortsman et al., 2022).

2. Implementation, Selection, and Optimization

Successful model souping requires models to be “soup compatible”: that is, fine-tuned from a shared initialization and situated in a common solution basin. Implementation steps include:

Conducting a hyperparameter sweep or varying seeds to generate soup ingredients from a base pretrained model;
Evaluating individual models on a held-out validation set;
Aggregating models by chosen recipe (uniform, greedy, or learned mixing).

Greedy soup involves sorting models by validation accuracy and sequentially including models only if the soup’s accuracy does not decrease. Learned soup, as in MEHL-Soup (Li et al., 4 Jul 2024), optimizes mixing coefficients in a hyperplane defined by available models via block coordinate gradient descent; this reduces memory overhead (more than $13\times$ savings), supporting soup construction on resource-constrained hardware.

Recent work extends these recipes:

Hierarchical Souping (HS) (Sanjeev et al., 20 Mar 2024) for medical imaging merges local groups of models (generated by cyclical LR scheduling) before a global greedy combination, addressing rough error surfaces;
Partition Learned Souping (PLS) (Zuber et al., 14 Mar 2025) for GNNs partitions graphs and computes soup coefficients layer-wise on subgraphs, yielding $24.5\times$ speedup and $76\%$ memory reduction;
Objective Soups (Saif et al., 12 Aug 2025) merge gradient signals in a hierarchical, multi-objective setting, separating highly conflicting objectives, and leveraging layer-selection to reduce computational overhead.

3. Extensions to Modular, Multitask, and Domain-composed Models

Model souping generalizes beyond whole-model weight averaging to modular and multi-domain settings:

Adapter Soups (Holtermann et al., 23 Jan 2024) average domain-specific adapter layers in base LLMs, using sophisticated weighting (corpus similarity, entropy, priors) and combination strategies (parameter averaging vs. ensembling).
Soup-Adapters (Roschkowski, 8 Jul 2025) aggregate outputs (or directly concatenate weights) of multiple adapters for robust domain adaptation, including for DINOv2 and CLIP. Reparameterization of the ensemble permits single-branch inference.
Cross-dataset or style-mixing soups (Wortsman et al., 2022, Biggs et al., 12 Jun 2024) merge models fine-tuned on disparate data shards, supporting continual learning, unlearning (removal of domains by weight subtraction), and hybrid generative abilities (e.g., “zero-shot” style blending in Diffusion Soup).
Soup-of-Experts (Ablin et al., 3 Feb 2025) pretrains a bank of expert parameters and learns a mapping from domain mixture weights to coefficients, instantiating specialist models adaptively and at scale.

In multi-task and multilingual speech, Objective Soups (Saif et al., 12 Aug 2025) compose losses hierarchically (vectorized, bilevel, or multilevel) and use conflict-dodging layer-selection to coordinate learning across diverse objectives and languages.

4. Empirical Results and Performance

Across domains, model souping yields consistent, measurable improvements over conventional selection and single-model approaches:

ImageNet: Greedy soups with large transformers (ViT-G/14) attain top-1 accuracy of $90.94\%$ (state-of-the-art), with $\sim$ 0.5–0.7 percentage point improvement over best individual model (Wortsman et al., 2022);
NLP: BERT/T5 soups improve GLUE task accuracy/F1 by up to $\sim$ 0.5–0.8 percentage points; uncertainty reduction by souping fine-tuned runs (Frick et al., 2023);
Multimodal: SoupLM (Bai et al., 11 Jul 2024) integrating LLaMA, Vicuna, LLaVA variants achieves superior results on MMLU, GSM8K, LLaVA-Bench across language and vision-language tasks, with fine-grained or regularized soup further enhancing generalization;
GNNs: Learned Soup and Partition Learned Soup yield $1.2\%$ accuracy improvement and up to $24.5\times$ training speedup;
Generative Diffusion Models: Diffusion Soup achieves $30\%$ to $59\%$ improvement in Image Reward and TIFA for domain/composite style generation (Biggs et al., 12 Jun 2024);
State Space Models: Document souping of hidden states yields HotpotQA multi-hop reasoning nearly matching the cross-encoder baseline (Jafari et al., 29 May 2025);
Unlearning in CLIP: Model souping restores zero-shot accuracy after forgetting stages, maintaining performance on retaining and unrelated classes while effective subgroup forgetting is preserved (zhang et al., 3 Jun 2025).

Meta-regression analysis predicts composition effectiveness with up to $0.97$ Pearson/Spearman correlation for adapter soups (Holtermann et al., 23 Jan 2024), enabling practitioners to anticipate soup performance.

5. Theoretical Insights and Analytic Perspective

The theoretical foundation of model souping lies in landscape geometry and ensemble equivalence. Analytical results show weight averaging and logit ensembling nearly coincide when interpolating between solutions in a flat loss region with high-confidence predictions (Wortsman et al., 2022, Menes et al., 31 Jan 2024). First-order Taylor expansion demonstrates the equivalence of ensemble logits and weight soup performance (formally $d_{w_0} \mathcal{L}_{ens} = d_{w_0} \mathcal{L}_{soup}$ ). In diffusion models, uniform soup yields generative sampling close to the geometric mean over data shards, introducing strong regularization and anti-memorization properties (Biggs et al., 12 Jun 2024).

Neural averaging (Lee et al., 20 Aug 2025) extends this further: model souping is interpreted as a one-step gradient descent in an amortized meta-optimization framework (“AME”), treating model differences as pseudogradients and optimizing the ensemble adaptively (using optimizers like Adam or Adagrad), without access to data. This perspective enables more expressive ensembling and identifies conditions under which adaptive meta-updates outperform standard averaging, notably in out-of-distribution scenarios.

6. Applications, Robustness, and Limitations

Model souping is effective for robustness to distribution shift, improvements in OOD accuracy, and zero-shot generalization:

In medical imaging, hierarchical souping surpasses greedy/flat soups with $\sim$ 6% gain on HAM10000 and CheXpert, handling rough error surfaces and domain heterogeneity (Sanjeev et al., 20 Mar 2024).
Adapter and objective soups cut sensitivity to hyperparameter tuning and offer practical efficiency in few-shot and multi-task adaptation (Roschkowski, 8 Jul 2025, Saif et al., 12 Aug 2025).
Compartmentalization in Diffusion Soup supports training-free continual learning, robust unlearning, privacy guarantees against memorization, and flexible style mixing (Biggs et al., 12 Jun 2024).
Layer-wise or transition-zone merging (layer swapping) provides modular, interpretable, post hoc cross-lingual transfer in LLMs (Bandarkar et al., 2 Oct 2024).

Scaling model souping presents challenges in memory and compute; recent block coordinate and partitioned strategies (MEHL-Soup, PLS) enable soup construction and optimization on single GPUs or large graphs (Li et al., 4 Jul 2024, Zuber et al., 14 Mar 2025). Limitations include the requirement that fine-tuned models lie in a shared basin; merging models across distant regions is prone to destructive interference.

7. Code Availability and Future Directions

Many papers provide open-source implementations:

Model soups and recipes: https://github.com/mlfoundations/model-soups (Wortsman et al., 2022)
Hierarchical souping for medical imaging: https://github.com/BioMedIA-MBZUAI/FissionFusion (Sanjeev et al., 20 Mar 2024)
Diffusion Soup: https://github.com/cmanifold/compartment (Biggs et al., 12 Jun 2024)
Memory Efficient Soups: https://github.com/nblt/MEHL-Soup (Li et al., 4 Jul 2024)
Objective Soups: https://github.com/afmsaif/Objective_Soups (Saif et al., 12 Aug 2025)

Ongoing directions include: principled meta-optimization for neural averaging (Lee et al., 20 Aug 2025), further modularization (layer/adapter/task/objective soups), scalability to hundreds or thousands of domains, learning adaptive soup coefficients via meta-learning, theory of soup compatibility, and extensions to continual, federated, and privacy-preserving learning. Model souping remains an active, rapidly evolving field with applications spanning vision, language, speech, graph, and generative domains.