Model Soups: Parameter Averaging

Updated 1 July 2025

Model soup averages weights of multiple trained neural networks to create a single model with improved performance and robustness at no extra inference cost.
Model soups use strategies like uniform, greedy, or pruned averaging; greedy and pruned methods select models based on validation for robust performance gains.
Model soups improve accuracy and robustness in vision, NLP, and OOD tasks, offering a computationally efficient alternative to traditional ensembling at no extra inference cost.

Model-Soup Parameter Averaging refers to a family of techniques for combining multiple trained neural networks by averaging their parameters (“weights”) to form a new network that often achieves improved predictive performance—frequently rivaling or even surpassing classical ensembling—without incurring any increase in inference or memory costs. Unlike traditional model selection, which picks a single best model from a set, or ensembling, which aggregates model predictions at inference time, model soup methods produce a single, merged model by averaging selected weights offline, maintaining exactly the same computational footprint as a standard model at deployment.

1. Foundations and Core Principles

Model-soup parameter averaging emerged from the observation that, when large pre-trained models are fine-tuned for downstream tasks, their optimization trajectories often remain within a single “low-error basin” in parameter space. Consequently, averaging the weights of these independently fine-tuned models yields a valid model that often generalizes better than any constituent model (Wortsman et al., 2022).

The archetypal model soup is constructed by simple arithmetic averaging of model weights: $\theta_{\mathcal{S}} = \frac{1}{|\mathcal{S}|} \sum_{i\in\mathcal{S}} \theta_i$ where $\mathcal{S}$ is a subset of $k$ models (called “ingredients”), each fine-tuned from the same initialization but, usually, with varied hyperparameters (e.g., optimizer, data augmentation, learning rate), and $\theta_i$ are their parameter vectors. Unlike ensembling, this operation is performed only once, prior to deployment.

Model soups contrast with:

Model selection: Discards all except the top-performing model on validation, forgoing substantial diversity.
Ensemble methods: Combine model outputs, but increase compute linearly with ensemble size.

2. Model Soup Construction Strategies

Three primary strategies have been established for model soup construction:

Uniform Soup: All candidate models are averaged equally. This approach is effective only when all ingredients are of roughly comparable quality, as inclusion of weak models degrades performance, especially with architectures more sensitive to initialization (Dansereau et al., 2023).
Greedy Soup: Models are sequentially added to the soup according to their validation accuracy, with a model retained only if its inclusion does not decrease validation performance. This safeguards against poor ingredients and reliably improves or matches the best single model's performance (empirically, typical performance gains are $0.5$–$1.0$ points in top-1 ImageNet accuracy for vision transformers) (Wortsman et al., 2022, Dansereau et al., 2023).
Learned Soup (or "parameterized soup"): Model weights are averaged with optimized (potentially per-model or per-layer) coefficients, typically found by maximizing accuracy on a validation set. This approach, while yielding slight additional improvements, is more computationally demanding to tune (Wortsman et al., 2022).

A related extension is Pruned Soup (Dansereau et al., 2023), which starts from the average of all candidates and iteratively removes models if their exclusion increases validation performance, often producing the best results with fewer ingredients.

3. Empirical Results and Application Domains

Model soup methods have been validated across multiple domains:

Vision: State-of-the-art top-1 accuracy on ImageNet has been achieved with greedy soups of ViT-G/14 fine-tuned from a shared initialization, surpassing the performance of any single model from a hyperparameter sweep [e.g., 90.94% top-1 with a greedy soup vs. 90.78% for the best individual model, (Wortsman et al., 2022)].

Distributional Robustness: Model soups offer marked improvements under distribution shift, with soups outperforming singles and sometimes even output ensembles on OOD datasets like ImageNet-R, ImageNet-Sketch, and ObjectNet.

Natural Language Processing: When applied to large pre-trained LLMs (BERT, T5), greedy soups often matched or exceeded the best single model on GLUE benchmarks; uniform soup performance could degrade if hyperparameter diversity was extreme.

Zero-shot Transfer: Soups composed of models fine-tuned on different datasets can improve zero-shot generalization on novel tasks.

Feature Ranking: Averaging the parameters of neural “mask” models yields robust feature importance estimates in tabular data, yielding consistency across random seeds that single-model approaches (and other feature-ranking neural methods) lack (Ucar et al., 2022).

Sparse Model Soups: When all pruned models share an identical sparsity mask, parameter averaging preserves sparsity and offers better OOD generalization and fairness compared to standard sparsification or dynamic sparse training methods (Zimmer et al., 2023).

4. Theoretical Underpinnings and Limitations

The theoretical validity of model soups relies on the geometry of the loss surface: if models to be merged occupy the same local minimum (i.e., are in the same “error basin”), the interpolation path between their parameters is low-loss. Analytical formulations relate the expected benefit of weight averaging to the flatness (convexity) of the loss landscape and the confidence of model predictions: $L_\alpha^\mathrm{soup} - L_\alpha^\mathrm{ensemble} \approx \frac{\alpha(1-\alpha)}{2}\left( -\frac{d^2}{d\alpha^2} L_\alpha + \beta^2 E_x\operatorname{Var}_{Y\sim \mathrm{softmax}(\beta f(x;\theta_\alpha))}[\Delta f_Y(x)]\right)$ (Wortsman et al., 2022). This equation decomposes performance differences into curvature of the loss and prediction uncertainty.

Limitations include:

Divergent Initializations: Averaging models from different random seeds or optimization paths often yields worse-than-best performance without explicit alignment (Dansereau et al., 2023).
Architecture Sensitivity: Some architectures (e.g., ViTs) are more “soupable” than others (residual networks or EfficientNet can fail unless models closely agree).
Calibration: Model soups do not guarantee improved or preserved calibration; they retain calibration of constituent models only approximately.
Incomplete Theoretical Guarantees: The benefits are less predictable when models are not in overlapping basins or have been fine-tuned in non-compatible ways.

5. Best Practices and Practical Guidelines

Effective use of model soups requires several considerations:

Base models should be fine-tuned from the same pre-trained initialization to ensure proximity in weight space.
Ingredient selection matters: Uniform soups can be diluted by poor models; greedy or pruned approaches avoid this by validating each addition or removal.
Alignment for LLMs: For transformer models with tied input/output embeddings, ensure consistent tying and vocabulary mapping.
Validation Set Usage: Always reserve a held-out validation set for ingredient selection and soup construction to avoid overfitting.
Scalability: Souping can be applied at modest computational cost (after the fine-tuning sweep); for very large pools, approximation approaches such as RADIN (Menes et al., 31 Jan 2024) can reduce the resource requirements.

6. Broader Impact and Extensions

Model-soup parameter averaging has catalyzed a range of extensions:

Population Parameter Averaging (PAPA): Aligns model parameters during training to make averaging feasible even from scratch, achieving near-ensemble performance without post hoc curation (Jolicoeur-Martineau et al., 2023).
Sparse Model Soups: Extends the paradigm to pruned models, enabling resource-efficient high-performing sparse networks (Zimmer et al., 2023).
Continual Learning: Sequential averaging with checkpoints mitigates catastrophic forgetting, providing strong performance without replay buffers or explicit penalties (Kleiman et al., 9 Jan 2025).
Cross-lingual and Transfer Tasks: Averaging checkpoints from diverse runs sidesteps the need for source/target validation or extensive hyperparameter sweeps, producing robust transfer models (Schmidt et al., 2023).

7. Summary Table: Key Model Soup Variants and Properties

Variant	Selection/Weighting	Benefits	Limitations
Uniform Soup	Simple average over all models	Fast, no validation needed	Sensitive to poor models
Greedy Soup	Add if validation score holds/improves	Stable, robust improvements	Linear in ingredient count
Pruned Soup	Remove if validation improves	Fewer models, best accuracy	Multiple passes, more compute
PAPA	Population averaging during training	Ensemble-like performance	Training complexity
Sparse Soup	Average pruned models (shared mask)	Efficient OOD, preserves sparsity	Requires matched sparsity

References

Wortsman, F., et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time" (Wortsman et al., 2022).
Sobral, M., et al. "Model soups to increase inference without increasing compute time" (Dansereau et al., 2023).
Jordan, J., et al. "PopulAtion Parameter Averaging (PAPA)" (Jolicoeur-Martineau et al., 2023).
Subramanian, S., et al. "Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging" (Zimmer et al., 2023).
Schreiber, J., et al. "Soup to go: mitigating forgetting during continual learning with model averaging" (Kleiman et al., 9 Jan 2025).
And other cited works as listed in the provided research descriptions.

Model-soup parameter averaging constitutes a robust and computationally efficient strategy for enhancing both accuracy and out-of-distribution robustness in deep learning. Its applicability across domains (vision, language, tabular data), compatibility with large model setups, and flexible integration with ongoing research in ensembling, pruning, and training dynamics underpin its continued impact in machine learning practice.