Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks (2401.15275v1)

Published 27 Jan 2024 in cs.CV

Abstract: Transformer neural networks are increasingly replacing prior architectures in a wide range of applications in different data modalities. The increasing size and computational demands of fine-tuning large pre-trained transformer neural networks pose significant challenges for the widespread adoption of these models for applications that demand on-edge computing. To tackle this challenge, continual learning (CL) emerges as a solution by facilitating the transfer of knowledge across tasks that arrive sequentially for an autonomously learning agent. However, current CL methods mainly focus on learning tasks that are exclusively vision-based or language-based. We propose a transformer-based CL framework focusing on learning tasks that involve both vision and language, known as Vision-and-Language (VaL) tasks. Due to the success of transformers in other modalities, our architecture has the potential to be used in multimodal learning settings. In our framework, we benefit from introducing extra parameters to a base transformer to specialize the network for each task. As a result, we enable dynamic model expansion to learn several tasks in a sequence. We also use knowledge distillation to benefit from relevant past experiences to learn the current task more efficiently. Our proposed method, Task Attentive Multimodal Continual Learning (TAM-CL), allows for the exchange of information between tasks while mitigating the problem of catastrophic forgetting. Notably, our approach is scalable, incurring minimal memory and time overhead. TAM-CL achieves state-of-the-art (SOTA) performance on challenging multimodal tasks

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Vision transformers in image restoration: A survey. Sensors, 23(5):2385, 2023.
  2. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  3. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  4. Pseudo-rehearsal: Achieving deep reinforcement learning without catastrophic forgetting. Neurocomputing, 428:291–307, 2021.
  5. Contextual attention network: Transformer meets u-net. In International Workshop on Machine Learning in Medical Imaging, pages 377–386. Springer, 2022.
  6. Mult: An end-to-end multitask learning transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12031–12041, 2022.
  7. PIQA: Reasoning about physical commonsense in natural language. In AAAI Conference on Artificial Intelligence, 2020.
  8. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA, 2006. Association for Computing Machinery.
  9. Task-attentive transformer architecture for continual learning of vision-and-language tasks using knowledge distillation. In EMNLP, 2023.
  10. Rich Caruana. Multitask learning. Springer, 1998.
  11. Learning efficient object detection models with knowledge distillation. In Neural Information Processing Systems, 2017.
  12. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020.
  13. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207, 2018.
  14. Vault: Augmenting the vision-and-language transformer with the propagation of deep language representations. arXiv preprint arXiv:2208.09021, 2022.
  15. Self-paced weight consolidation for continual learning. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  18. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9285–9295, 2022.
  19. Continual learning with transformers for image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3774–3781, 2022.
  20. Memory efficient continual learning with transformers. Advances in Neural Information Processing Systems, 35:10629–10642, 2022.
  21. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pages 3061–3071. PMLR, 2020.
  22. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  23. Actor-transformers for group activity recognition, 2020.
  24. Online aware synapse weighted autoencoder for recovering random missing data in wastewater treatment process. IEEE Transactions on Artificial Intelligence, 2023.
  25. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  26. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  27. The inaturalist species classification and detection dataset, 2018.
  28. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1439–1449, 2021.
  29. Class-incremental learning using generative experience replay based on time-aware regularization. arXiv preprint arXiv:2310.03898, 2023.
  30. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
  31. Using task features for zero-shot knowledge transfer in lifelong learning. In Ijcai, volume 16, pages 1620–1626, 2016.
  32. Learn continually, generalize rapidly: Lifelong knowledge accumulation for few-shot learning. In Findings of Empirical Methods in Natural Language Processing (Findings of EMNLP), 2021.
  33. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 449–456. IEEE, 2019.
  34. Achieving forgetting prevention and knowledge transfer in continual learning. Advances in Neural Information Processing Systems, 34:22443–22456, 2021.
  35. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  36. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  37. Meta-consolidation for continual learning. Advances in Neural Information Processing Systems, 33:14374–14386, 2020.
  38. Lvit: language meets vision transformer in medical image segmentation. IEEE transactions on medical imaging, 2023.
  39. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
  40. The CLEAR benchmark: Continual learning on real-world imagery. In Neural Information Processing Systems (NeurIPS), 2021.
  41. Continual detection transformer for incremental object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23799–23808, 2023.
  42. Gradient episodic memory for continual learning, 2022.
  43. Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019.
  44. History repeats: Overcoming catastrophic forgetting for event-centric temporal knowledge graph completion. In Findings of ACL, 2023.
  45. D3former: Debiased dual distilled transformer for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2420–2429, 2023.
  46. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline, 2020.
  47. Towards exemplar-free continual learning in vision transformers: an account of attention, functional and weight regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3820–3829, 2022.
  48. Exploring models and data for image question answering, 2015.
  49. Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
  50. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  51. Fitnets: Hints for thin deep nets, 2015.
  52. Mohammad Rostami. Lifelong domain adaptation via consolidated internal distribution. Advances in neural information processing systems, 34:11172–11183, 2021.
  53. Cognitively inspired learning of incremental drifting concepts. In IJCAI, 2023.
  54. Overcoming concept shift in domain-aware settings through consolidated internal distributions. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 9623–9631, 2023.
  55. Using task descriptions in lifelong machine learning for improved performance and zero-shot transfer. Journal of Artificial Intelligence Research, 67:673–704, 2020.
  56. Multi-agent distributed lifelong learning for collective knowledge acquisition. arXiv preprint arXiv:1709.05412, 2017.
  57. Generative continual concept learning. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5545–5552, 2020.
  58. Complementary learning for overcoming catastrophic forgetting using experience replay. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 3339–3345, 2019.
  59. Detection and continual learning of novel face presentation attacks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14851–14860, 2021.
  60. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  61. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  62. Learning audio-visual speech representation by masked multimodal cluster prediction, 2022.
  63. Continual learning with deep generative replay. Neural Information Processing Systems (NeurIPS), 2017.
  64. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series, 2013.
  65. Climb: A continual learning benchmark for vision-and-language tasks. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  66. I2i: Initializing adapters with improvised knowledge. In Conference on Lifelong Learning Agents, 2023.
  67. Unsupervised model adaptation for continual semantic segmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 2593–2601, 2021.
  68. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2019.
  69. A corpus for reasoning about natural language grounded in photographs, 2018.
  70. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
  71. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
  72. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, 2019.
  73. Functional regularisation for continual learning with gaussian processes, 2020.
  74. Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734, 2019.
  75. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  76. Attention is all you need, 2017.
  77. Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695, 2019.
  78. Continual few-shot learning with transformer adaptation and knowledge regularization. In Proceedings of the ACM Web Conference, volume 2023, 2023.
  79. Transformer-based acoustic modeling for hybrid speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6874–6878. IEEE, 2020.
  80. Continual learning with lifelong vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 171–181, 2022.
  81. Visual entailment: A novel task for fine-grained image understanding, 2019.
  82. Deepchange: A large long-term person re-identification benchmark with clothes change, 2022.
  83. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  84. Continual learning for natural language generations with transformer calibration. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 40–49, 2022.
  85. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations.
  86. HellaSwag: Can a machine really finish your sentence? In Association for Computational Linguistics (ACL), 2019.
  87. Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision, pages 1–22, 2023.
  88. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.
  89. A survey on negative transfer. IEEE/CAA Journal of Automatica Sinica, 10(2):305–329, 2022.
  90. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yuliang Cai (5 papers)
  2. Mohammad Rostami (64 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.