Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model Soups: Parameter Averaging

Updated 1 July 2025
  • Model soup averages weights of multiple trained neural networks to create a single model with improved performance and robustness at no extra inference cost.
  • Model soups use strategies like uniform, greedy, or pruned averaging; greedy and pruned methods select models based on validation for robust performance gains.
  • Model soups improve accuracy and robustness in vision, NLP, and OOD tasks, offering a computationally efficient alternative to traditional ensembling at no extra inference cost.

Model-Soup Parameter Averaging refers to a family of techniques for combining multiple trained neural networks by averaging their parameters (“weights”) to form a new network that often achieves improved predictive performance—frequently rivaling or even surpassing classical ensembling—without incurring any increase in inference or memory costs. Unlike traditional model selection, which picks a single best model from a set, or ensembling, which aggregates model predictions at inference time, model soup methods produce a single, merged model by averaging selected weights offline, maintaining exactly the same computational footprint as a standard model at deployment.

1. Foundations and Core Principles

Model-soup parameter averaging emerged from the observation that, when large pre-trained models are fine-tuned for downstream tasks, their optimization trajectories often remain within a single “low-error basin” in parameter space. Consequently, averaging the weights of these independently fine-tuned models yields a valid model that often generalizes better than any constituent model (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022).

The archetypal model soup is constructed by simple arithmetic averaging of model weights: θS=1SiSθi\theta_{\mathcal{S}} = \frac{1}{|\mathcal{S}|} \sum_{i\in\mathcal{S}} \theta_i where S\mathcal{S} is a subset of kk models (called “ingredients”), each fine-tuned from the same initialization but, usually, with varied hyperparameters (e.g., optimizer, data augmentation, learning rate), and θi\theta_i are their parameter vectors. Unlike ensembling, this operation is performed only once, prior to deployment.

Model soups contrast with:

  • Model selection: Discards all except the top-performing model on validation, forgoing substantial diversity.
  • Ensemble methods: Combine model outputs, but increase compute linearly with ensemble size.

2. Model Soup Construction Strategies

Three primary strategies have been established for model soup construction:

  1. Uniform Soup: All candidate models are averaged equally. This approach is effective only when all ingredients are of roughly comparable quality, as inclusion of weak models degrades performance, especially with architectures more sensitive to initialization (Model soups to increase inference without increasing compute time, 2023).
  2. Greedy Soup: Models are sequentially added to the soup according to their validation accuracy, with a model retained only if its inclusion does not decrease validation performance. This safeguards against poor ingredients and reliably improves or matches the best single model's performance (empirically, typical performance gains are $0.5$–$1.0$ points in top-1 ImageNet accuracy for vision transformers) (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022, Model soups to increase inference without increasing compute time, 2023).
  3. Learned Soup (or "parameterized soup"): Model weights are averaged with optimized (potentially per-model or per-layer) coefficients, typically found by maximizing accuracy on a validation set. This approach, while yielding slight additional improvements, is more computationally demanding to tune (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022).

A related extension is Pruned Soup (Model soups to increase inference without increasing compute time, 2023), which starts from the average of all candidates and iteratively removes models if their exclusion increases validation performance, often producing the best results with fewer ingredients.

3. Empirical Results and Application Domains

Model soup methods have been validated across multiple domains:

Vision: State-of-the-art top-1 accuracy on ImageNet has been achieved with greedy soups of ViT-G/14 fine-tuned from a shared initialization, surpassing the performance of any single model from a hyperparameter sweep [e.g., 90.94% top-1 with a greedy soup vs. 90.78% for the best individual model, (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022)].

Distributional Robustness: Model soups offer marked improvements under distribution shift, with soups outperforming singles and sometimes even output ensembles on OOD datasets like ImageNet-R, ImageNet-Sketch, and ObjectNet.

Natural Language Processing: When applied to large pre-trained LLMs (BERT, T5), greedy soups often matched or exceeded the best single model on GLUE benchmarks; uniform soup performance could degrade if hyperparameter diversity was extreme.

Zero-shot Transfer: Soups composed of models fine-tuned on different datasets can improve zero-shot generalization on novel tasks.

Feature Ranking: Averaging the parameters of neural “mask” models yields robust feature importance estimates in tabular data, yielding consistency across random seeds that single-model approaches (and other feature-ranking neural methods) lack (Parameter Averaging for Feature Ranking, 2022).

Sparse Model Soups: When all pruned models share an identical sparsity mask, parameter averaging preserves sparsity and offers better OOD generalization and fairness compared to standard sparsification or dynamic sparse training methods (Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging, 2023).

4. Theoretical Underpinnings and Limitations

The theoretical validity of model soups relies on the geometry of the loss surface: if models to be merged occupy the same local minimum (i.e., are in the same “error basin”), the interpolation path between their parameters is low-loss. Analytical formulations relate the expected benefit of weight averaging to the flatness (convexity) of the loss landscape and the confidence of model predictions: LαsoupLαensembleα(1α)2(d2dα2Lα+β2ExVarYsoftmax(βf(x;θα))[ΔfY(x)])L_\alpha^\mathrm{soup} - L_\alpha^\mathrm{ensemble} \approx \frac{\alpha(1-\alpha)}{2}\left( -\frac{d^2}{d\alpha^2} L_\alpha + \beta^2 E_x\operatorname{Var}_{Y\sim \mathrm{softmax}(\beta f(x;\theta_\alpha))}[\Delta f_Y(x)]\right) (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022). This equation decomposes performance differences into curvature of the loss and prediction uncertainty.

Limitations include:

  • Divergent Initializations: Averaging models from different random seeds or optimization paths often yields worse-than-best performance without explicit alignment (Model soups to increase inference without increasing compute time, 2023).
  • Architecture Sensitivity: Some architectures (e.g., ViTs) are more “soupable” than others (residual networks or EfficientNet can fail unless models closely agree).
  • Calibration: Model soups do not guarantee improved or preserved calibration; they retain calibration of constituent models only approximately.
  • Incomplete Theoretical Guarantees: The benefits are less predictable when models are not in overlapping basins or have been fine-tuned in non-compatible ways.

5. Best Practices and Practical Guidelines

Effective use of model soups requires several considerations:

  • Base models should be fine-tuned from the same pre-trained initialization to ensure proximity in weight space.
  • Ingredient selection matters: Uniform soups can be diluted by poor models; greedy or pruned approaches avoid this by validating each addition or removal.
  • Alignment for LLMs: For transformer models with tied input/output embeddings, ensure consistent tying and vocabulary mapping.
  • Validation Set Usage: Always reserve a held-out validation set for ingredient selection and soup construction to avoid overfitting.
  • Scalability: Souping can be applied at modest computational cost (after the fine-tuning sweep); for very large pools, approximation approaches such as RADIN (RADIN: Souping on a Budget, 31 Jan 2024) can reduce the resource requirements.

6. Broader Impact and Extensions

Model-soup parameter averaging has catalyzed a range of extensions:

7. Summary Table: Key Model Soup Variants and Properties

Variant Selection/Weighting Benefits Limitations
Uniform Soup Simple average over all models Fast, no validation needed Sensitive to poor models
Greedy Soup Add if validation score holds/improves Stable, robust improvements Linear in ingredient count
Pruned Soup Remove if validation improves Fewer models, best accuracy Multiple passes, more compute
PAPA Population averaging during training Ensemble-like performance Training complexity
Sparse Soup Average pruned models (shared mask) Efficient OOD, preserves sparsity Requires matched sparsity

References

Model-soup parameter averaging constitutes a robust and computationally efficient strategy for enhancing both accuracy and out-of-distribution robustness in deep learning. Its applicability across domains (vision, language, tabular data), compatibility with large model setups, and flexible integration with ongoing research in ensembling, pruning, and training dynamics underpin its continued impact in machine learning practice.