Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weighted Ensemble Models Are Strong Continual Learners (2312.08977v4)

Published 14 Dec 2023 in cs.LG, cs.AI, and cs.CV

Abstract: In this work, we study the problem of continual learning (CL) where the goal is to learn a model on a sequence of tasks, such that the data from the previous tasks becomes unavailable while learning on the current task data. CL is essentially a balancing act between being able to learn on the new task (i.e., plasticity) and maintaining the performance on the previously learned concepts (i.e., stability). Intending to address the stability-plasticity trade-off, we propose to perform weight-ensembling of the model parameters of the previous and current tasks. This weighted-ensembled model, which we call Continual Model Averaging (or CoMA), attains high accuracy on the current task by leveraging plasticity, while not deviating too far from the previous weight configuration, ensuring stability. We also propose an improved variant of CoMA, named Continual Fisher-weighted Model Averaging (or CoFiMA), that selectively weighs each parameter in the weights ensemble by leveraging the Fisher information of the weights of the model. Both variants are conceptually simple, easy to implement, and effective in attaining state-of-the-art performance on several standard CL benchmarks. Code is available at: https://github.com/IemProg/CoFiMA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Memory retention–the synaptic stability versus plasticity dilemma. Trends in neurosciences, 28(2):73–78, 2005.
  2. Memory aware synapses: Learning what (not) to forget. In ECCV, 2018.
  3. Shun-ichi Amari. Neural learning in structured parameter spaces - natural riemannian gradient. In Advances in Neural Information Processing Systems. MIT Press, 1996.
  4. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33, 2020.
  5. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
  6. On lazy training in differentiable programming, 2020.
  7. Fusing finetuned models for better pretraining, 2022.
  8. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  9. Learning without memorizing. In CVPR, 2019.
  10. bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book, 2021. R package version 0.0.2.9000.
  11. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  12. Effect of scale on catastrophic forgetting in neural networks. In ICLR, 2022.
  13. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML. PMLR, 2021.
  14. The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations (ICLR), 2022.
  15. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
  16. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning (ICML), 2020.
  17. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4), 1999.
  18. The elements of statistical learning. Springer series in statistics New York, 2001.
  19. A survey on ensemble learning for data stream classification. CSUR, 50(2), 2017.
  20. Deep residual learning for image recognition. In CVPR, pages 770–778, 2015.
  21. Masked autoencoders are scalable vision learners, 2021.
  22. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  24. Patching open-vocabulary models by interpolating weights, 2022.
  25. Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
  26. On the stability-plasticity dilemma of class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20196–20204, 2023.
  27. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  28. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 2017.
  29. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013a.
  30. 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013b.
  31. Learning multiple layers of features from tiny images. Technical report, 2009.
  32. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  33. Learning without forgetting. In ECCV. Springer, 2016.
  34. Learning without forgetting. TPAMI, 40(12), 2017.
  35. Approximate fisher information matrix to characterize the training of deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):15–26, 2020.
  36. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
  37. Class-incremental learning: survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  38. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  39. An empirical investigation of the role of pre-training in lifelong learning. arXiv preprint arXiv:2112.09153, 2021.
  40. An empirical investigation of the role of pre-training in lifelong learning, 2023.
  41. Linear mode connectivity in multitask and continual learning, 2020.
  42. Learning and transforming general representations to break down stability-plasticity dilemma. In Proceedings of the Asian Conference on Computer Vision, pages 3994–4010, 2022.
  43. What is being transferred in transfer learning?, 2021.
  44. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  45. Dinov2: Learning robust visual features without supervision, 2023.
  46. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  47. Revisiting natural gradient for deep networks, 2014.
  48. Continual normalization: Rethinking batch normalization for online continual learning, 2022.
  49. Gdumb: A simple approach that questions our progress in continual learning. In ECCV, 2020.
  50. Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
  51. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437):179–191, 1997.
  52. Effect of scale on catastrophic forgetting in neural networks. In Proceedings of the International Conference on Learning Representations, 2021.
  53. Diverse weight averaging for out-of-distribution generalization, 2023.
  54. Imagenet-21k pretraining for the masses, 2021.
  55. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  56. Overcoming catastrophic forgetting with hard attention to the task. In ICML. PMLR, 2018.
  57. On generalizing beyond domains in cross-domain continual learning. arXiv preprint arXiv:2203.03970, 2022.
  58. On the variance of the fisher information for deep learning. Advances in Neural Information Processing Systems, 34:5708–5719, 2021.
  59. James C Spall. Monte carlo computation of the fisher information matrix in nonstandard settings. Journal of Computational and Graphical Statistics, 14(4):889–909, 2005.
  60. James C. Spall. Improved methods for monte carlo estimation of the fisher information matrix. In 2008 American Control Conference, pages 2395–2400, 2008.
  61. Diverse ensembles improve calibration. In International Conference on Machine Learning (ICML) Workshop on Uncertainty and Robustness in Deep Learning, 2020.
  62. Rethinking the inception architecture for computer vision, 2015.
  63. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  64. Efficientdet: Scalable and efficient object detection. In CVPR, 2020.
  65. Gido M. van de Ven and Andreas S. Tolias. Three scenarios for continual learning, 2019.
  66. Pivot: Prompting for video continual learning. arXiv preprint arXiv:2212.04842, 2022.
  67. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011a.
  68. The caltech-ucsd birds-200-2011 dataset. 2011b.
  69. Ordisco: Effective and efficient usage of incremental unlabeled data for semi-supervised continual learning. In CVPR, 2021.
  70. A comprehensive survey of continual learning: Theory, method and application, 2023.
  71. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. arXiv preprint arXiv:2207.12819, 2022a.
  72. Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV. Springer, 2022b.
  73. Learning to prompt for continual learning. In CVPR, 2022c.
  74. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022a.
  75. Robust fine-tuning of zero-shot models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  76. Class-incremental learning with strong pre-trained models. In CVPR, 2022.
  77. Large scale incremental learning. In CVPR, 2019.
  78. Continual object detection via prototypical task correlation guided gating mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9255–9264, 2022.
  79. Swalp: Stochastic weight averaging in low precision training. In International Conference on Machine Learning, pages 7015–7024. PMLR, 2019.
  80. Continual learning through synaptic intelligence. In ICML, 2017.
  81. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
  82. Slca: Slow learner with classifier alignment for continual learning on a pre-trained model, 2023.
  83. Regularize, expand and compress: Nonexpansive continual learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020.
  84. Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need, 2023.
Citations (8)

Summary

  • The paper introduces CoFiMA, a method that ensembles model weights using Fisher information to balance adaptation with retention.
  • It demonstrates that weighted ensemble techniques consistently deliver higher accuracy on standard continual learning benchmarks.
  • The study shows that model performance is further enhanced by selecting optimal pre-trained architectures for both supervised and self-supervised settings.

Understanding the Balance Between Stability and Plasticity in Continual Learning

Continual learning (CL) is a critical area of AI research that tackles the challenge of enabling models to learn from a sequence of tasks without forgetting previously acquired knowledge. This requires a delicate balance between plasticity—adapting to new tasks—and stability—retaining existing knowledge.

Novel Weight-Ensemble Methods for Continual Learning

Recent strides in CL have introduced two distinct methods to aid models in achieving this balance. The first approach, Continual Model Averaging (CoMA), involves averaging the parameters of models from the current and previous tasks. This is designed to foster plasticity for the new task while maintaining stability for past knowledge. An illustration of this can be seen in how a model trained on task A can be combined with a model fine-tuned on both task A and B to reside along a linear path that retains proficiency in both tasks.

The second approach, dubbed Continual Fisher-weighted Model Averaging (CoFiMA), refines CoMA by selectively ensembling model weights based on their task-specific importance, as determined by Fisher information. Fisher information serves as a measure of how crucial each parameter is to the task, hence allowing more significant parameters to have a more substantial impact on the combined model. This nuanced weight ensemble addresses the potential shortcoming of treating all weights equally, which could lead to suboptimal performance.

Performance Advantages of CoFiMA

CoFiMA's effectiveness is underscored through extensive experimentation on standard CL benchmarks. It consistently outperforms other PTM-based CL methods, demonstrating significant gains in accuracy across various datasets. Notably, CoFiMA's approach displays robustness not just with supervised pre-trained models, but also when leveraging self-supervised models, proving its flexibility and general applicability.

Insights from Model Comparisons

Competing methods like Sequential Fine-Tuning, PROMPT-based approaches (L2P and DualPrompt), and Experience Replay strategies (DER++) were analyzed alongside CoFiMA. While these methods offer their own advantages, CoFiMA’s integration of Fisher information offers a distinct edge by fine-tuning the weight-ensemble process. The performance reported in the continual learning setting measures up commendably close to that achieved when models are trained on all tasks simultaneously – a stringent upper-bound benchmark often unattainable in real-world applications.

Conclusion

In conclusion, CoFiMA represents a step forward in reconciling the stability-plasticity dilemma that is so central to continual learning. Its ability to discern the relevance of parameters to different tasks allows it to create balanced models that not only excel in new tasks but also resist catastrophic forgetting of historical information. These attributes solidify CoFiMA's position as a potential go-to framework for initiatives confronting the adversities of CL.

It's also crucial to note the research indicates performance can be influenced by the choice of the pre-trained model (PTM) used as a backbone. The paper thoroughly assesses the impact of using different architectures, both supervised and self-supervised, providing valuable insights into the role of pre-training in enhancing continual learning strategies.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com