Model Soups: Parameter Averaging
- Model soup averages weights of multiple trained neural networks to create a single model with improved performance and robustness at no extra inference cost.
- Model soups use strategies like uniform, greedy, or pruned averaging; greedy and pruned methods select models based on validation for robust performance gains.
- Model soups improve accuracy and robustness in vision, NLP, and OOD tasks, offering a computationally efficient alternative to traditional ensembling at no extra inference cost.
Model-Soup Parameter Averaging refers to a family of techniques for combining multiple trained neural networks by averaging their parameters (“weights”) to form a new network that often achieves improved predictive performance—frequently rivaling or even surpassing classical ensembling—without incurring any increase in inference or memory costs. Unlike traditional model selection, which picks a single best model from a set, or ensembling, which aggregates model predictions at inference time, model soup methods produce a single, merged model by averaging selected weights offline, maintaining exactly the same computational footprint as a standard model at deployment.
1. Foundations and Core Principles
Model-soup parameter averaging emerged from the observation that, when large pre-trained models are fine-tuned for downstream tasks, their optimization trajectories often remain within a single “low-error basin” in parameter space. Consequently, averaging the weights of these independently fine-tuned models yields a valid model that often generalizes better than any constituent model (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022).
The archetypal model soup is constructed by simple arithmetic averaging of model weights: where is a subset of models (called “ingredients”), each fine-tuned from the same initialization but, usually, with varied hyperparameters (e.g., optimizer, data augmentation, learning rate), and are their parameter vectors. Unlike ensembling, this operation is performed only once, prior to deployment.
Model soups contrast with:
- Model selection: Discards all except the top-performing model on validation, forgoing substantial diversity.
- Ensemble methods: Combine model outputs, but increase compute linearly with ensemble size.
2. Model Soup Construction Strategies
Three primary strategies have been established for model soup construction:
- Uniform Soup: All candidate models are averaged equally. This approach is effective only when all ingredients are of roughly comparable quality, as inclusion of weak models degrades performance, especially with architectures more sensitive to initialization (Model soups to increase inference without increasing compute time, 2023).
- Greedy Soup: Models are sequentially added to the soup according to their validation accuracy, with a model retained only if its inclusion does not decrease validation performance. This safeguards against poor ingredients and reliably improves or matches the best single model's performance (empirically, typical performance gains are $0.5$–$1.0$ points in top-1 ImageNet accuracy for vision transformers) (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022, Model soups to increase inference without increasing compute time, 2023).
- Learned Soup (or "parameterized soup"): Model weights are averaged with optimized (potentially per-model or per-layer) coefficients, typically found by maximizing accuracy on a validation set. This approach, while yielding slight additional improvements, is more computationally demanding to tune (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022).
A related extension is Pruned Soup (Model soups to increase inference without increasing compute time, 2023), which starts from the average of all candidates and iteratively removes models if their exclusion increases validation performance, often producing the best results with fewer ingredients.
3. Empirical Results and Application Domains
Model soup methods have been validated across multiple domains:
Vision: State-of-the-art top-1 accuracy on ImageNet has been achieved with greedy soups of ViT-G/14 fine-tuned from a shared initialization, surpassing the performance of any single model from a hyperparameter sweep [e.g., 90.94% top-1 with a greedy soup vs. 90.78% for the best individual model, (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022)].
Distributional Robustness: Model soups offer marked improvements under distribution shift, with soups outperforming singles and sometimes even output ensembles on OOD datasets like ImageNet-R, ImageNet-Sketch, and ObjectNet.
Natural Language Processing: When applied to large pre-trained LLMs (BERT, T5), greedy soups often matched or exceeded the best single model on GLUE benchmarks; uniform soup performance could degrade if hyperparameter diversity was extreme.
Zero-shot Transfer: Soups composed of models fine-tuned on different datasets can improve zero-shot generalization on novel tasks.
Feature Ranking: Averaging the parameters of neural “mask” models yields robust feature importance estimates in tabular data, yielding consistency across random seeds that single-model approaches (and other feature-ranking neural methods) lack (Parameter Averaging for Feature Ranking, 2022).
Sparse Model Soups: When all pruned models share an identical sparsity mask, parameter averaging preserves sparsity and offers better OOD generalization and fairness compared to standard sparsification or dynamic sparse training methods (Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging, 2023).
4. Theoretical Underpinnings and Limitations
The theoretical validity of model soups relies on the geometry of the loss surface: if models to be merged occupy the same local minimum (i.e., are in the same “error basin”), the interpolation path between their parameters is low-loss. Analytical formulations relate the expected benefit of weight averaging to the flatness (convexity) of the loss landscape and the confidence of model predictions: (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022). This equation decomposes performance differences into curvature of the loss and prediction uncertainty.
Limitations include:
- Divergent Initializations: Averaging models from different random seeds or optimization paths often yields worse-than-best performance without explicit alignment (Model soups to increase inference without increasing compute time, 2023).
- Architecture Sensitivity: Some architectures (e.g., ViTs) are more “soupable” than others (residual networks or EfficientNet can fail unless models closely agree).
- Calibration: Model soups do not guarantee improved or preserved calibration; they retain calibration of constituent models only approximately.
- Incomplete Theoretical Guarantees: The benefits are less predictable when models are not in overlapping basins or have been fine-tuned in non-compatible ways.
5. Best Practices and Practical Guidelines
Effective use of model soups requires several considerations:
- Base models should be fine-tuned from the same pre-trained initialization to ensure proximity in weight space.
- Ingredient selection matters: Uniform soups can be diluted by poor models; greedy or pruned approaches avoid this by validating each addition or removal.
- Alignment for LLMs: For transformer models with tied input/output embeddings, ensure consistent tying and vocabulary mapping.
- Validation Set Usage: Always reserve a held-out validation set for ingredient selection and soup construction to avoid overfitting.
- Scalability: Souping can be applied at modest computational cost (after the fine-tuning sweep); for very large pools, approximation approaches such as RADIN (RADIN: Souping on a Budget, 31 Jan 2024) can reduce the resource requirements.
6. Broader Impact and Extensions
Model-soup parameter averaging has catalyzed a range of extensions:
- Population Parameter Averaging (PAPA): Aligns model parameters during training to make averaging feasible even from scratch, achieving near-ensemble performance without post hoc curation (PopulAtion Parameter Averaging (PAPA), 2023).
- Sparse Model Soups: Extends the paradigm to pruned models, enabling resource-efficient high-performing sparse networks (Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging, 2023).
- Continual Learning: Sequential averaging with checkpoints mitigates catastrophic forgetting, providing strong performance without replay buffers or explicit penalties (Soup to go: mitigating forgetting during continual learning with model averaging, 9 Jan 2025).
- Cross-lingual and Transfer Tasks: Averaging checkpoints from diverse runs sidesteps the need for source/target validation or extensive hyperparameter sweeps, producing robust transfer models (One For All & All For One: Bypassing Hyperparameter Tuning with Model Averaging For Cross-Lingual Transfer, 2023).
7. Summary Table: Key Model Soup Variants and Properties
Variant | Selection/Weighting | Benefits | Limitations |
---|---|---|---|
Uniform Soup | Simple average over all models | Fast, no validation needed | Sensitive to poor models |
Greedy Soup | Add if validation score holds/improves | Stable, robust improvements | Linear in ingredient count |
Pruned Soup | Remove if validation improves | Fewer models, best accuracy | Multiple passes, more compute |
PAPA | Population averaging during training | Ensemble-like performance | Training complexity |
Sparse Soup | Average pruned models (shared mask) | Efficient OOD, preserves sparsity | Requires matched sparsity |
References
- Wortsman, F., et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time" (Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022).
- Sobral, M., et al. "Model soups to increase inference without increasing compute time" (Model soups to increase inference without increasing compute time, 2023).
- Jordan, J., et al. "PopulAtion Parameter Averaging (PAPA)" (PopulAtion Parameter Averaging (PAPA), 2023).
- Subramanian, S., et al. "Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging" (Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging, 2023).
- Schreiber, J., et al. "Soup to go: mitigating forgetting during continual learning with model averaging" (Soup to go: mitigating forgetting during continual learning with model averaging, 9 Jan 2025).
- And other cited works as listed in the provided research descriptions.
Model-soup parameter averaging constitutes a robust and computationally efficient strategy for enhancing both accuracy and out-of-distribution robustness in deep learning. Its applicability across domains (vision, language, tabular data), compatibility with large model setups, and flexible integration with ongoing research in ensembling, pruning, and training dynamics underpin its continued impact in machine learning practice.