A Second-Order Perspective on Model Compositionality and Incremental Learning (2405.16350v2)
Abstract: The fine-tuning of deep pre-trained models has revealed compositional properties, with multiple specialized modules that can be arbitrarily composed into a single, multi-task model. However, identifying the conditions that promote compositionality remains an open issue, with recent efforts concentrating mainly on linearized networks. We conduct a theoretical study that attempts to demystify compositionality in standard non-linear networks through the second-order Taylor approximation of the loss function. The proposed formulation highlights the importance of staying within the pre-training basin to achieve composable modules. Moreover, it provides the basis for two dual incremental training algorithms: the one from the perspective of multiple models trained individually, while the other aims to optimize the composed model as a whole. We probe their application in incremental classification tasks and highlight some valuable skills. In fact, the pool of incrementally learned modules not only supports the creation of an effective multi-task model but also enables unlearning and specialization in certain tasks.
- Conditional channel gated networks for task-aware continual learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2020.
- Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision, 2018.
- Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, 2019.
- Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, 2019.
- Does combining parameter-efficient modules improve few-shot transfer accuracy? arXiv preprint arXiv:2402.15414, 2024.
- Lora learns less and forgets less. arXiv preprint arXiv:2405.09673, 2024.
- A-la-carte prompt tuning (apt): Combining distinct data via composable prompting. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
- Mammoth - an extendible (general) continual learning framework for pytorch, 2020.
- Dark experience for general continual learning: a strong, simple baseline. Advances in Neural Information Processing Systems, 2020.
- Rethinking Experience Replay: a Bag of Tricks for Continual Learning. In ICPR, 2020.
- Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pages 532–547, 2018.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
- Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018.
- Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4109–4118, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Thomas George. Nngeometry: easy and fast fisher information matrices and neural tangent kernels in pytorch. 2020.
- Thomas George. NNGeometry: Easy and Fast Fisher Information Matrices and Neural Tangent Kernels in PyTorch, 2021.
- Caltech-256 object category dataset. CalTech Report, 2007.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. IEEE International Conference on Computer Vision, 2021.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
- An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060, 2015.
- Ferenc Huszár. Note on the quadratic penalties in elastic weight consolidation. Proceedings of the National Academy of Sciences, 115(11):E2496–E2497, 2018.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
- Joint training of deep ensembles fails due to learner collusion. Advances in Neural Information Processing Systems, 2024.
- Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
- Population parameter averaging (papa). arXiv preprint arXiv:2304.03094, 2023.
- Continual learning of a mixed sequence of similar and dissimilar tasks. In Advances in Neural Information Processing Systems, 2020.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
- Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
- John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006.
- Trainable weight averaging: Efficient training by optimizing historical solutions. In The Eleventh International Conference on Learning Representations, 2022.
- Trainable weight averaging: A general approach for subspace training. arXiv preprint arXiv:2205.13104, 2023.
- Deep model fusion: A survey. arXiv preprint arXiv:2309.15698, 2023.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- Tangent model composition for ensembling and continual fine-tuning. In IEEE International Conference on Computer Vision, 2023.
- Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018.
- James Martens. New insights and perspectives on the natural gradient method. Journal of Machine Learning Research, 21(146):1–76, 2020.
- Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 2022.
- Understanding cross-domain few-shot learning based on domain similarity and few-shot difficulty. Advances in Neural Information Processing Systems, 35:2622–2636, 2022.
- Task arithmetic in the tangent space: Improved editing of pre-trained models. Advances in Neural Information Processing Systems, 2024.
- Prompt algebra for task composition. arXiv preprint arXiv:2306.00310, 2023.
- Modular deep learning. arXiv preprint arXiv:2302.11529, 2023.
- Recognizing indoor scenes. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE, 2009.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 31, 2018.
- Divide and not forget: Ensemble of selectively trained experts in continual learning. In The Twelfth International Conference on Learning Representations, 2024.
- To stay or not to stay in the pre-train basin: Insights on ensembling in transfer learning. Advances in Neural Information Processing Systems, 2024.
- Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018.
- Overcoming Catastrophic Forgetting with Hard Attention to the Task. In International Conference on Machine Learning, 2018.
- Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2023.
- Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053, 2023.
- Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734, 2019.
- The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Dualprompt: Complementary prompting for rehearsal-free continual learning. In European Conference on Computer Vision, 2022.
- Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning. PMLR, 2022.
- Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 2024.
- Continual learning through synaptic intelligence. In International Conference on Machine Learning, 2017.
- Learning useful representations for shifting tasks and distributions. In International Conference on Machine Learning. PMLR, 2023.
- Composing parameter-efficient modules with arithmetic operation. Advances in Neural Information Processing Systems, 2024.
- Preventing zero-shot transfer degradation in continual learning of vision-language models. In IEEE International Conference on Computer Vision, 2023.
- A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
- Angelo Porrello (32 papers)
- Lorenzo Bonicelli (13 papers)
- Pietro Buzzega (11 papers)
- Monica Millunzi (2 papers)
- Simone Calderara (64 papers)
- Rita Cucchiara (142 papers)