Papers
Topics
Authors
Recent
2000 character limit reached

Continual Learning via Learning a Continual Memory in Vision Transformer (2303.08250v4)

Published 14 Mar 2023 in cs.CV and cs.LG

Abstract: This paper studies task-incremental continual learning (TCL) using Vision Transformers (ViTs). Our goal is to improve the overall streaming-task performance without catastrophic forgetting by learning task synergies (e.g., a new task learns to automatically reuse/adapt modules from previous similar tasks, or to introduce new modules when needed, or to skip some modules when it appears to be an easier task). One grand challenge is how to tame ViTs at streaming diverse tasks in terms of balancing their plasticity and stability in a task-aware way while overcoming the catastrophic forgetting. To address the challenge, we propose a simple yet effective approach that identifies a lightweight yet expressive ``sweet spot'' in the ViT block as the task-synergy memory in TCL. We present a Hierarchical task-synergy Exploration-Exploitation (HEE) sampling based neural architecture search (NAS) method for effectively learning task synergies by structurally updating the identified memory component with respect to four basic operations (reuse, adapt, new and skip) at streaming tasks. The proposed method is thus dubbed as CHEEM (Continual Hierarchical-Exploration-Exploitation Memory). In experiments, we test the proposed CHEEM on the challenging Visual Domain Decathlon (VDD) benchmark and the 5-Dataset benchmark. It obtains consistently better performance than the prior art with sensible CHEEM learned continually.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  2. Memory aware synapses: Learning what (not) to forget. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), Computer Vision – ECCV 2018, pp.  144–161, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01219-9.
  3. Online continual learning with maximal interfered retrieval. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. URL https://proceedings.neurips.cc/paper/2019/file/15825aee15eb335cc13f9b559f166ee8-Paper.pdf.
  4. Gradient based sample selection for online continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/e562cd9c0768d5464b64cf61da7fc6bb-Paper.pdf.
  5. Selfless sequential learning. In International Conference on Learning Representations, 2019c. URL https://openreview.net/forum?id=Bkxbrn0cYX.
  6. The effectiveness of memory replay in large scale continual learning. arXiv preprint arXiv:2010.02418, 2020.
  7. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  8218–8227, June 2021.
  8. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.
  9. Dynamic image networks for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3034–3042, 2016. doi: 10.1109/CVPR.2016.331.
  10. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  11. Yaroslav Bulatov. notmnist dataset. http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html, 2011.
  12. Dark experience for general continual learning: a strong, simple baseline. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  15920–15930. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/b704ea2c39778f07c617f6b7ce480e9e-Paper.pdf.
  13. Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  9516–9525, October 2021.
  14. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), Computer Vision – ECCV 2018, pp.  556–572, Cham, 2018. Springer International Publishing. ISBN 978-3-030-01252-6.
  15. Efficient lifelong learning with A-GEM. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019a. URL https://openreview.net/forum?id=Hkf2_sC5FX.
  16. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019b.
  17. Using hindsight to anchor past knowledge in continual learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6993–7001, May 2021. doi: 10.1609/aaai.v35i8.16861. URL https://ojs.aaai.org/index.php/AAAI/article/view/16861.
  18. The distributed nature of working memory. Trends in cognitive sciences, 21(2):111–124, 2017.
  19. Describing textures in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.  3606–3613, 2014. doi: 10.1109/CVPR.2014.461.
  20. Gan memory with no forgetting. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  16481–16494. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/bf201d5407a6509fa536afc4b380577e-Paper.pdf.
  21. Continual prototype evolution: Learning online from non-stationary data streams. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.  8230–8239, 2021. doi: 10.1109/ICCV48922.2021.00814.
  22. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  23. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision – ECCV 2020, pp.  86–102, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58565-5.
  24. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  9285–9295, June 2022.
  25. Adversarial continual learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision – ECCV 2020, pp.  386–402, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58621-8.
  26. Continual learning with transformers for image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.  3774–3781, June 2022.
  27. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
  28. Clr: Channel-wise lightweight reprogramming for continual learning. arXiv preprint arXiv:2307.11386, 2023a.
  29. Lightweight learner for shared knowledge lifelong learning. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=Jjl2c8kWUc.
  30. Spottune: Transfer learning through adaptive fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  31. Single path one-shot neural architecture search with uniform sampling. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision – ECCV 2020, pp.  544–560, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58517-4.
  32. Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA), pp.  9769–9776, 2019. doi: 10.1109/ICRA.2019.8793982.
  33. Remind your neural network to prevent catastrophic forgetting. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision – ECCV 2020, pp.  466–483, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58598-3.
  34. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  35. A memory transformer network for incremental learning. arXiv preprint arXiv:2210.04485, 2022.
  36. Adam: A method for stochastic optimization, 2014. URL https://arxiv.org/abs/1412.6980.
  37. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. URL https://www.pnas.org/doi/abs/10.1073/pnas.1611835114.
  38. Learning multiple layers of features from tiny images. 2009.
  39. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015. doi: 10.1126/science.aab3050. URL https://www.science.org/doi/abs/10.1126/science.aab3050.
  40. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791.
  41. Technical report for iccv 2021 challenge sslad-track3b: Transformers are better continual learners. arXiv preprint arXiv:2201.04924, 2022.
  42. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  3925–3934. PMLR, 2019. URL http://proceedings.mlr.press/v97/li19m.html.
  43. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018. doi: 10.1109/TPAMI.2017.2773081.
  44. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=S1eYHoC5FX.
  45. Rethinking the value of network pruning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019b. URL https://openreview.net/forum?id=rJlnB3C5Ym.
  46. Gradient episodic memory for continual learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/f87522788a2be2d171666752f97ddebb-Paper.pdf.
  47. Fine-grained visual classification of aircraft. Technical report, 2013.
  48. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  49. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pp.  109–165. Academic Press, 1989. doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://www.sciencedirect.com/science/article/pii/S0079742108605368.
  50. Nettailor: Tuning the architecture, not just the weights. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  51. S. Munder and D.M. Gavrila. An experimental study on pedestrian classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11):1863–1868, 2006. doi: 10.1109/TPAMI.2006.217.
  52. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
  53. Variational continual learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BkQqq0gRb.
  54. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp.  722–729, 2008. doi: 10.1109/ICVGIP.2008.47.
  55. Towards exemplar-free continual learning in vision transformers: An account of attention, functional and weight regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.  3820–3829, June 2022.
  56. Dualnet: Continual learning, fast and slow. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  16131–16144. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/86a1fa88adb5c33bd7a68ac2f9f3f96b-Paper.pdf.
  57. Gdumb: A simple approach that questions our progress in continual learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision – ECCV 2020, pp.  524–540, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58536-5.
  58. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html.
  59. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pp.  4780–4789, 2019.
  60. Learning multiple visual domains with residual adapters. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017a. URL https://proceedings.neurips.cc/paper/2017/file/e7b24b112a44fdd9ee93bdf998c6ca0e-Paper.pdf.
  61. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017b.
  62. Scaling vision with sparse mixture of experts. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=NGPmH3vbAA_.
  63. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  64. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  65. Progress & compress: A scalable framework for continual learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  4528–4537. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/schwarz18a.html.
  66. Continual learning with deep generative replay. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf.
  67. Calibrating CNNs for Lifelong Learning. In Advances in Neural Information Processing Systems, volume 33, pp.  15579–15590. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/b3b43aeeacb258365cc69cdaf42a68af-Abstract.html.
  68. A closer look at rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2409–2419, 2023.
  69. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. URL http://arxiv.org/abs/1212.0402.
  70. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332, 2012. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2012.02.016. URL https://www.sciencedirect.com/science/article/pii/S0893608012000457. Selected Papers from IJCNN 2011.
  71. Lifelong robot learning. Robotics and Autonomous Systems, 15(1):25–46, 1995. ISSN 0921-8890. doi: https://doi.org/10.1016/0921-8890(95)00004-Y. URL https://www.sciencedirect.com/science/article/pii/092188909500004Y. The Biology and Technology of Intelligent Autonomous Agents.
  72. Three things everyone should know about vision transformers. arXiv preprint arXiv:2203.09795, 2022.
  73. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022. ISSN 2522-5839. doi: 10.1038/s42256-022-00568-3. URL https://doi.org/10.1038/s42256-022-00568-3.
  74. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  75. Efficient continual learning with modular networks and task-driven priors. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=EKV158tSfwv.
  76. Efficient feature transformations for discriminative and generative continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  13865–13875, 2021.
  77. Cortical feedback loops bind distributed representations of working memory. Nature, 608(7922):381–389, 2022.
  78. Task difficulty aware parameter allocation & regularization for lifelong learning. arXiv preprint arXiv:2304.05288, 2023.
  79. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. arXiv preprint arXiv:2207.12819, 2022a.
  80. Dualprompt: Complementary prompting for rehearsal-free continual learning, 2022b. URL https://arxiv.org/abs/2204.04799.
  81. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  139–149, June 2022c.
  82. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  83. Supermasks in superposition. Advances in Neural Information Processing Systems, 33:15173–15184, 2020.
  84. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  85. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  86. Meta-attention for vit-backed continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  150–159, June 2022.
  87. b-darts: Beta-decay regularization for differentiable architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10874–10883, June 2022.
  88. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Sk7KsfW0-.
  89. Improving vision transformers for incremental learning. arXiv preprint arXiv:2112.06103, 2021.
  90. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10819–10829, 2022a.
  91. Metaformer baselines for vision. arXiv preprint arXiv:2210.13452, 2022b.
  92. Continual learning through synaptic intelligence. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  3987–3995. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/zenke17a.html.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.