Model Soups: Efficient Weight-Averaging

Updated 7 March 2026

Model soups are weight-space ensembles that average independently fine-tuned neural networks to leverage hyperparameter diversity within a shared low-loss basin.
They provide improved generalization, robustness to distribution shifts, and efficient zero-inference-cost integration across applications like vision, language, and diffusion models.
Advanced variants such as greedy, learned, and layer-wise soups optimize mixing coefficients to fine-tune performance and support multi-domain or multi-objective tasks.

Model soups are weight-space ensembles obtained by averaging the parameters of multiple independently fine-tuned neural networks with the same architecture, typically initialized from the same pre-trained checkpoint. This technique provides a method to leverage the diversity generated by hyperparameter sweeps and stochastic training, resulting in improved generalization, robustness to distribution shift, and—in certain settings—extensible properties such as zero-shot continual learning or controllable trade-offs. Model soups incur no additional inference cost, as the result is a single model with a standard parameter set. The methodology has been extended from vision transformers and LLMs to diffusion generative models, graph neural networks, foundation model adapters, sparse/pruned networks, federated learning agents, and multi-objective RLHF policies.

1. Formalism and General Principles

The canonical model soup procedure, introduced by Wortsman et al. (Wortsman et al., 2022), constructs a new set of weights by computing an affine (typically convex) combination of $K$ independently fine-tuned models: $\theta_{\text{soup}} = \sum_{k=1}^K \lambda_k\, \theta^{(k)},$ with coefficients $\lambda_k \geq 0$ and $\sum_k \lambda_k = 1$ . The two principal strategies are:

Uniform soup: $\lambda_k = 1/K$ for all $k$ .
Greedy soup: Iteratively add candidate models in order of descending validation accuracy, including a model if its addition improves validation performance when averaged with the current soup.

Empirical investigation across large-scale vision and language tasks showed that these soups frequently outperform the single best model from a hyperparameter sweep in both in-distribution and out-of-distribution settings, under the prerequisite that all candidate models reside in a common, connected low-loss basin in parameter space (i.e., strong mode connectivity due to joint initialization and network architecture) (Wortsman et al., 2022, Dansereau et al., 2023). Parameter-space averaging, in contrast to output-level ensembling, introduces no extra computational or memory cost at inference.

2. Algorithms and Variants

Model soup construction encompasses several algorithmic extensions:

Learned soup: Optimizes the mixing coefficients $\lambda_k$ on a validation set via gradient descent, optionally in a constrained or regularized way (e.g., block coordinate descent on the hyperplane constrained by $\sum_k \lambda_k = 1$ ) (Li et al., 2024, Zuber et al., 14 Mar 2025).
Greedy and pruned soups: Greedy soup selects a minimal subset of models whose addition to the average preserves or improves validation accuracy; pruned soup starts from the uniform average and recursively prunes away models that do not harm the validation metric when removed, using multiple passes for stability (Dansereau et al., 2023).
Layer-wise and fine-grained soups: Mixing coefficients may be learned per-layer or per-module, rather than globally, enabling additional expressivity and marginal accuracy gains, especially in deep transformers and ViTs (Li et al., 2024, Bai et al., 2024).
Efficient construction: Block coordinate gradient descent and partitioned subgraph optimization allow the exploitation of large model families without exceeding modest computational resource budgets (Zuber et al., 14 Mar 2025, Li et al., 2024).

The following table summarizes common soup construction strategies:

Strategy	Description	Constraints
Uniform	Equal weights over all candidate models	$\lambda_k=1/K$
Greedy	Incrementally add if val improves	$\lambda_k\geq 0$
Learned	Optimize $\lambda_k$ via validation loss	$\sum_k \lambda_k=1$
Layer-wise	$\lambda_k$ vector per layer or module	$\sum_k \lambda_k^{(\ell)}=1$ per $\ell$
Pruned	Start with all, remove any non-harmful	support subset may shrink

3. Theoretical Foundations

The effectiveness of model soups is theoretically grounded in the geometry of the neural loss landscape and the concentration of solutions in wide, connected basins when fine-tuning from shared pre-trained initialization. Weight-averaged models land at the barycenter of these modes, benefiting from:

Variance reduction: Averaging reduces the variance of predictions across models, as formalized via the classical bias-variance decomposition (Tran et al., 2 Mar 2026).
Flatness and connectivity: Souped weights inherit the flatness properties of the containing basin; convex paths in weight space between independently fine-tuned models remain in a low-loss region (Wortsman et al., 2022, Dansereau et al., 2023).
Pareto optimality under trade-offs: In multi-objective or adversarial settings, convex or affine combinations of extremal fine-tuned models trace Pareto fronts for robustness under multiple threats or objectives (Croce et al., 2023, Xie et al., 15 Feb 2025).

Negative results occur if candidate models occupy distinct, disconnected minima, as in training-from-scratch or when fine-tuning from highly divergent hyperparameters; in such cases, averaging can destroy useful structure, leading to collapse (Dansereau et al., 2023).

4. Application Domains

Model soups have been extended well beyond their origin in vision transformer fine-tuning:

Sparse and Pruned Networks: Sparse Model Soups (SMS) preserve pruning masks by enforcing identical connectivity during retraining splits, yielding substantial accuracy/OOD performance gains at high sparsity rates (e.g., +1.1% on CIFAR-100 at 98% sparsity), without densifying the network after averaging (Zimmer et al., 2023).
Diffusion Models: Soups constructed from independently trained specialist diffusion models on dataset shards enable continual training-free learning and unlearning, and provably avoid memorization by geometric-mean sampling distributions (Biggs et al., 2024).
Adapters and Domain Adaptation: Soup-Adapters average adapter module parameters, reparameterizing multiple adapters into a single efficient module that smooths the sensitivity to hyperparameters and improves both in-distribution and OOD performance under few-shot adaptation (Roschkowski, 8 Jul 2025).
Graph Neural Networks: Learned souping and partitioned learned souping reduce wall-time and memory overhead for GNNs, with up to 1.2% accuracy improvement and 24.5× speedup over exhaustive interpolation (Zuber et al., 14 Mar 2025).
Federated Learning: Local Superior Soups (LSS) interpolate local model checkpoints via randomized affine combinations, imposing affinity and diversity regularizers that lead to dramatic improvements in global accuracy under strict round constraints (Chen et al., 2024).
Multi-modal Model Fusion: SoupLM merges contrasting LLM or VLM checkpoints (Vicuna, LLaVA, etc.) into integrated, well-generalized models via whole-model or module-wise interpolation of weights, recovering 1–1.5% additional accuracy in language and vision-language tasks (Bai et al., 2024).
Multi-objective RLHF: Bone Soups construct a structured set of backbone models via MORLHF, then merge (“soup”) them via a circulant basis to produce controllably optimal multi-objective policies at inference (Xie et al., 15 Feb 2025).

5. Empirical Findings and Best Practices

Empirical studies across applications converge on several robust patterns:

Best performance requires pretrained initialization and shared architectures: Success hinges on all “ingredients” (candidate models) being aligned in both initialization and final label/feature dimensions (Wortsman et al., 2022, Dansereau et al., 2023).
Hyperparameter diversity among ingredients: Gains are typically highest for candidate models fine-tuned with different learning rates, seeds, augmentations, weight decays, or data orders, provided these do not cause basin separation (Wortsman et al., 2022, Zimmer et al., 2023).
Greedy/pruned selection consistently outperforms uniform in presence of outliers: Uniform soups risk collapse if poor checkpoints are included; greedy and pruned variants provide robustness (Dansereau et al., 2023).
Layer-wise and learned soups offer small but measurable improvements over global mixing: Layer-wise $\lambda_k^{(\ell)}$ can capture fine-grained adaptation not possible with a single global coefficient, especially in deep models (Li et al., 2024).
Model soups outperform or match conventional ensembles with no additional inference cost: Standard output- or logit-based ensembles require as many forward passes as ingredients; soups perform a single forward pass at inference (Wortsman et al., 2022, Biggs et al., 2024).

Model soups have been shown to deliver up to 0.6–1.5 percentage point gains in top-1 accuracy on ImageNet (Wortsman et al., 2022), 0.7 in macro-F1 on challenging low-resource ICH datasets (Tran et al., 2 Mar 2026), and superior robustness to distribution shifts, adversarially perturbed data, and compositional tasks in diffusion and RL (see individual paper metrics for architecture- and benchmark-specific deltas).

6. Limitations, Extensions, and Open Problems

Several limitations and frontiers for model soups are recognized:

Disconnected minima: When fine-tuned models are not in a shared basin, parameter averaging is ineffective or detrimental, as observed with ResNet or EfficientNet trained from scratch (Dansereau et al., 2023).
Calibration of batch/layer normalization: Averaged models may retain inconsistent running statistics, requiring recalibration after construction (Dansereau et al., 2023).
Mask and sparsity matching: In sparse and pruned soups, mask mismatches can densify the model; solutions require phase-wise mask sharing or specialized pruning-aware averaging (Zimmer et al., 2023).
Efficient construction for large $K$ : Conventional learned-soup approaches are non-scalable at large $K$ due to memory and compute demand; MEHL-Soup and PLS methods are emerging solutions (Zuber et al., 14 Mar 2025, Li et al., 2024).
Automated coefficient selection and extrapolation: Grid search over affine weights for multi-objective or adversarial settings is viable at low dimension, but extension to high-dimensional spaces or automated coefficient learning is an open area (Croce et al., 2023, Xie et al., 15 Feb 2025).
MonoSoup for single-checkpoint post-hoc balancing: When only one fine-tuned checkpoint is available, SVD-based “MonoSoup” allows decomposition into high- and low-rank update directions, achieving much of the OOD benefit of full soups with trivial memory cost (Abdollahpoorrostam et al., 10 Feb 2026).

Future work includes extending model soups beyond homogenous families (heterogeneous architectures), combining soups with distillation or Bayesian methods, and theoretical analysis of loss landscape connectivity predictors for soup success.

7. Impact and Outlook

Model soups have reshaped best-practice protocols for both production and research pipelines involving large-scale model adaptation. They enable systematic exploitation of “wasted” diversity in hyperparameter sweeps, yield robust, low-variance models for deployment, and support new strategies for multi-domain or multi-objective model integration without retraining or complex ensemble infrastructure. Recent innovations such as hierarchical souping (Sanjeev et al., 2024), real-time controllable generation (Xie et al., 15 Feb 2025), and continual diffusion soup updating (Biggs et al., 2024) continue to expand both the theoretical and practical scope of the approach. Soups' zero–inference-cost profile and extensibility position them as a central tool in scalable, robust deep learning.