Weighted Ensemble Models Are Strong Continual Learners (2312.08977v4)
Abstract: In this work, we study the problem of continual learning (CL) where the goal is to learn a model on a sequence of tasks, such that the data from the previous tasks becomes unavailable while learning on the current task data. CL is essentially a balancing act between being able to learn on the new task (i.e., plasticity) and maintaining the performance on the previously learned concepts (i.e., stability). Intending to address the stability-plasticity trade-off, we propose to perform weight-ensembling of the model parameters of the previous and current tasks. This weighted-ensembled model, which we call Continual Model Averaging (or CoMA), attains high accuracy on the current task by leveraging plasticity, while not deviating too far from the previous weight configuration, ensuring stability. We also propose an improved variant of CoMA, named Continual Fisher-weighted Model Averaging (or CoFiMA), that selectively weighs each parameter in the weights ensemble by leveraging the Fisher information of the weights of the model. Both variants are conceptually simple, easy to implement, and effective in attaining state-of-the-art performance on several standard CL benchmarks. Code is available at: https://github.com/IemProg/CoFiMA.
- Memory retention–the synaptic stability versus plasticity dilemma. Trends in neurosciences, 28(2):73–78, 2005.
- Memory aware synapses: Learning what (not) to forget. In ECCV, 2018.
- Shun-ichi Amari. Neural learning in structured parameter spaces - natural riemannian gradient. In Advances in Neural Information Processing Systems. MIT Press, 1996.
- Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33, 2020.
- An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
- On lazy training in differentiable programming, 2020.
- Fusing finetuned models for better pretraining, 2022.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Learning without memorizing. In CVPR, 2019.
- bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book, 2021. R package version 0.0.2.9000.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Effect of scale on catastrophic forgetting in neural networks. In ICLR, 2022.
- Convit: Improving vision transformers with soft convolutional inductive biases. In ICML. PMLR, 2021.
- The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations (ICLR), 2022.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning (ICML), 2020.
- Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4), 1999.
- The elements of statistical learning. Springer series in statistics New York, 2001.
- A survey on ensemble learning for data stream classification. CSUR, 50(2), 2017.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2015.
- Masked autoencoders are scalable vision learners, 2021.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Patching open-vocabulary models by interpolating weights, 2022.
- Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence (UAI), 2018.
- On the stability-plasticity dilemma of class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20196–20204, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 2017.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013a.
- 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013b.
- Learning multiple layers of features from tiny images. Technical report, 2009.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Learning without forgetting. In ECCV. Springer, 2016.
- Learning without forgetting. TPAMI, 40(12), 2017.
- Approximate fisher information matrix to characterize the training of deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):15–26, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
- Class-incremental learning: survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
- An empirical investigation of the role of pre-training in lifelong learning. arXiv preprint arXiv:2112.09153, 2021.
- An empirical investigation of the role of pre-training in lifelong learning, 2023.
- Linear mode connectivity in multitask and continual learning, 2020.
- Learning and transforming general representations to break down stability-plasticity dilemma. In Proceedings of the Asian Conference on Computer Vision, pages 3994–4010, 2022.
- What is being transferred in transfer learning?, 2021.
- On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
- Dinov2: Learning robust visual features without supervision, 2023.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Revisiting natural gradient for deep networks, 2014.
- Continual normalization: Rethinking batch normalization for online continual learning, 2022.
- Gdumb: A simple approach that questions our progress in continual learning. In ECCV, 2020.
- Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
- Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437):179–191, 1997.
- Effect of scale on catastrophic forgetting in neural networks. In Proceedings of the International Conference on Learning Representations, 2021.
- Diverse weight averaging for out-of-distribution generalization, 2023.
- Imagenet-21k pretraining for the masses, 2021.
- Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
- Overcoming catastrophic forgetting with hard attention to the task. In ICML. PMLR, 2018.
- On generalizing beyond domains in cross-domain continual learning. arXiv preprint arXiv:2203.03970, 2022.
- On the variance of the fisher information for deep learning. Advances in Neural Information Processing Systems, 34:5708–5719, 2021.
- James C Spall. Monte carlo computation of the fisher information matrix in nonstandard settings. Journal of Computational and Graphical Statistics, 14(4):889–909, 2005.
- James C. Spall. Improved methods for monte carlo estimation of the fisher information matrix. In 2008 American Control Conference, pages 2395–2400, 2008.
- Diverse ensembles improve calibration. In International Conference on Machine Learning (ICML) Workshop on Uncertainty and Robustness in Deep Learning, 2020.
- Rethinking the inception architecture for computer vision, 2015.
- Rethinking the inception architecture for computer vision. In CVPR, 2016.
- Efficientdet: Scalable and efficient object detection. In CVPR, 2020.
- Gido M. van de Ven and Andreas S. Tolias. Three scenarios for continual learning, 2019.
- Pivot: Prompting for video continual learning. arXiv preprint arXiv:2212.04842, 2022.
- The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011a.
- The caltech-ucsd birds-200-2011 dataset. 2011b.
- Ordisco: Effective and efficient usage of incremental unlabeled data for semi-supervised continual learning. In CVPR, 2021.
- A comprehensive survey of continual learning: Theory, method and application, 2023.
- S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. arXiv preprint arXiv:2207.12819, 2022a.
- Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV. Springer, 2022b.
- Learning to prompt for continual learning. In CVPR, 2022c.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022a.
- Robust fine-tuning of zero-shot models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Class-incremental learning with strong pre-trained models. In CVPR, 2022.
- Large scale incremental learning. In CVPR, 2019.
- Continual object detection via prototypical task correlation guided gating mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9255–9264, 2022.
- Swalp: Stochastic weight averaging in low precision training. In International Conference on Machine Learning, pages 7015–7024. PMLR, 2019.
- Continual learning through synaptic intelligence. In ICML, 2017.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
- Slca: Slow learner with classifier alignment for continual learning on a pre-trained model, 2023.
- Regularize, expand and compress: Nonexpansive continual learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020.
- Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need, 2023.