Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time (2203.05482v3)

Published 10 Mar 2022 in cs.LG, cs.CL, and cs.CV

Abstract: The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Mitchell Wortsman (29 papers)
  2. Gabriel Ilharco (26 papers)
  3. Samir Yitzhak Gadre (12 papers)
  4. Rebecca Roelofs (19 papers)
  5. Raphael Gontijo-Lopes (7 papers)
  6. Ari S. Morcos (31 papers)
  7. Hongseok Namkoong (40 papers)
  8. Ali Farhadi (138 papers)
  9. Yair Carmon (45 papers)
  10. Simon Kornblith (53 papers)
  11. Ludwig Schmidt (80 papers)
Citations (755)

Summary

  • The paper introduces model soups, which improve performance by averaging weights from diverse fine-tuning runs without extra inference cost.
  • The methodology leverages hyperparameter variations to achieve state-of-the-art results, including 90.94% top-1 accuracy on ViT-G.
  • Experimental results show model soups enhance accuracy across image classification and NLP tasks, promoting efficient, robust deployment.

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time

The paper explores the practice of model selection during the fine-tuning phase of large pre-trained models. Traditional approaches often involve picking the single best model based on validation performance, discarding others. However, this paper questions the efficacy of this strategy, proposing a novel technique termed "model soups," which involves averaging the weights of multiple fine-tuned models to enhance accuracy and robustness without increasing inference time.

Key Contributions

The central claim of the paper is that, unlike conventional ensembling, which requires multiple forward passes during inference, averaging the parameters of several fine-tuned models (each with distinct hyperparameter settings) results in improved performance metrics without additional computational overhead. This method is shown to be particularly effective in scenarios where models share a similar error basin in the loss landscape.

Experimental Findings

The authors thoroughly validate their approach by fine-tuning models from diverse architectures like CLIP, ALIGN, and ViT-G on ImageNet, achieving state-of-the-art results. For instance, the ViT-G model gained a notable 90.94% top-1 accuracy, surpassing existing benchmarks. Additionally, model soups demonstrated superior performance across various image classification and NLP tasks, as well as in out-of-distribution and zero-shot scenarios.

Theoretical Analysis

A theoretical extension of this paper outlines the conditions under which model soups approximate ensemble performance. The authors analytically relate improvements to error basin characteristics and verify empirically that the curvature and confidence landscape play crucial roles.

Implications and Future Directions

The paper's findings suggest practical advantages for deployment scenarios, where inference efficiency is critical. It challenges the convention of single-model selection and proposes a plausible alternative that exploits the redundant computation of varied hyper-parameterized fine-tuning runs. This work opens avenues for further research in parameter space exploration and loss landscape analysis to better understand model behavior in pre-trained neural networks.

Conclusion

The paper presents a compelling argument for the adoption of model soups in the fine-tuning process, underscoring their ability to enhance model accuracy without incurring additional costs during inference. This method has implications for both theoretical advancements and practical applications, particularly in environments that prioritize computational efficiency without sacrificing performance. Future research will look to further dissect this approach across various domains and datasets, potentially integrating it into broader AI workflows.

Youtube Logo Streamline Icon: https://streamlinehq.com