Papers
Topics
Authors
Recent
2000 character limit reached

DAM: Dynamic Adapter Merging for Continual Video QA Learning (2403.08755v2)

Published 13 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Given a set of continually streaming VidQA datasets, we sequentially train dataset-specific adapters for each dataset while freezing the parameters of a large pretrained video-language backbone. During inference, given a video-question sample from an unknown domain, our method first uses the proposed non-parametric router function to compute a probability for each adapter, reflecting how relevant that adapter is to the current video-question input instance. Subsequently, the proposed dynamic adapter merging scheme aggregates all the adapter weights into a new adapter instance tailored for that particular test sample to compute the final VidQA prediction, mitigating the impact of inaccurate router predictions and facilitating knowledge sharing across domains. Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains. We further extend DAM to continual image classification and image QA and outperform prior methods by a large margin. The code is publicly available at: https://github.com/klauscc/DAM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp.  2425–2433, 2015.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1728–1738, 2021.
  4. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.  961–970, 2015.
  5. Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International conference on computer vision, pp.  9516–9525, 2021a.
  6. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021b.
  7. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
  8. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp.  190–200, 2011.
  9. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  10. Vindlu: A recipe for effective video-and-language pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10739–10750, 2023.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  13. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9285–9295, 2022.
  14. Memory efficient continual learning with transformers. Advances in Neural Information Processing Systems, 35:10629–10642, 2022.
  15. Self-supervised models are continual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9621–9630, 2022.
  16. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  17. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  18. Env-qa: A video question answering benchmark for comprehensive understanding of dynamic environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1675–1685, 2021.
  19. A unified continual learning framework with general parameter-efficient tuning. arXiv preprint arXiv:2303.10070, 2023.
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
  21. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11287–11297, 2021.
  22. Re-basin via implicit sinkhorn differentiation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  20237–20246, 2022. URL https://api.semanticscholar.org/CorpusID:255096607.
  23. Stochastic weight averaging in parallel: Large-batch training that generalizes well. arXiv preprint arXiv:2001.02312, 2020.
  24. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  25. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
  26. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
  27. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  28. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022a.
  29. Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems, 35:29262–29277, 2022b.
  30. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2758–2766, 2017.
  31. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
  32. Generating instance-level prompts for rehearsal-free continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  11847–11857, October 2023.
  33. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  34. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  35. Learning multiple layers of features from tiny images. 2009.
  36. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  7331–7341, 2021.
  37. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  38. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  39. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
  40. Lavender: Unifying video-language understanding as masked language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23119–23129, 2023c.
  41. Learning to prompt knowledge transfer for open-world continual learning, 2023d.
  42. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4641–4650, 2016.
  43. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  44. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  45. Core50: a new dataset and benchmark for continuous object recognition. In Conference on robot learning, pp.  17–26. PMLR, 2017.
  46. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
  47. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6884–6893, 2017.
  48. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.  3195–3204, 2019.
  49. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  50. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
  51. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
  52. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  53. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  2630–2640, 2019.
  54. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  1406–1415, 2019.
  55. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  56. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
  57. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3202–3212, 2015.
  58. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pp.  146–162. Springer, 2022.
  59. Always be dreaming: A new approach for data-free class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9374–9384, 2021.
  60. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11909–11919, 2023a.
  61. A closer look at rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2409–2419, 2023b.
  62. Climb: A continual learning benchmark for vision-and-language tasks. Advances in Neural Information Processing Systems, 35:29440–29453, 2022.
  63. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  7464–7473, 2019.
  64. Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35:5696–5710, 2022a.
  65. Towards a general framework for continual learning with pre-training, 2023a.
  66. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023b.
  67. Orthogonal subspace learning for language model continual learning, 2023c.
  68. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. Advances in Neural Information Processing Systems, 35:5682–5695, 2022b.
  69. Continual learning with lifelong vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  171–181, 2022c.
  70. Dualprompt: Complementary prompting for rehearsal-free continual learning. In European Conference on Computer Vision, pp.  631–648. Springer, 2022d.
  71. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  139–149, 2022e.
  72. Unified coarse-to-fine alignment for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  2816–2827, October 2023d.
  73. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022a.
  74. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7959–7971, 2022b.
  75. Video graph transformer for video question answering. In European Conference on Computer Vision, pp.  39–58. Springer, 2022.
  76. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pp.  1645–1653, 2017.
  77. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5288–5296, 2016.
  78. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9878–9888, 2021.
  79. Resolving interference when merging models. arXiv preprint arXiv:2306.01708, 2023.
  80. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  1686–1697, 2021.
  81. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
  82. Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
  83. Learning from inside: Self-driven siamese sampling and reasoning for video question answering. Advances in Neural Information Processing Systems, 34:26462–26474, 2021.
  84. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  9127–9134, 2019.
  85. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
  86. Vqacl: A novel visual question answering continual learning setting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19102–19112, 2023.
  87. Cl-crossvqa: A continual learning benchmark for cross-domain visual question answering. arXiv preprint arXiv:2211.10567, 2022.
  88. Hierarchical prompts for rehearsal-free continual learning, 2024.
Citations (7)

Summary

  • The paper introduces Dynamic Adapter Merging, a novel method that dynamically merges dataset-specific adapters to mitigate catastrophic forgetting in continual video QA learning.
  • It leverages a frozen pretrained backbone with a non-parametric router to combine cross-domain insights, achieving a 9.1% accuracy improvement and reducing forgetting by 1.9%.
  • The approach demonstrates versatility by extending its robust, parameter-efficient strategy to tasks like image classification and image QA, highlighting its broad potential in continual learning.

Dynamic Adapter Merging for Continual Video Question Answering Learning

Introduction

Continual Learning (CL) of video question-answering (VidQA) models faces significant challenges, including catastrophic forgetting, adaptability to new datasets, and domain-inference during inference. Addressing these issues, we introduce a novel, parameter-efficient method named Dynamic Adapter Merging (\Modelname). This approach is designed for effective continual VidQA learning, enabling the model to adapt to sequentially streaming VidQA datasets without retraining from scratch or retaining previous data, thus significantly reducing catastrophic forgetting.

Approach

\Modelname~comprises several key components:

Freezing the Backbone

Our method leverages a large pretrained video-LLM as the backbone, which remains frozen to mitigate catastrophic forgetting.

Dataset-Specific Adapters

For each new dataset, \Modelname~trains a dataset-specific adapter while keeping the backbone and previously trained adapters frozen. This setup allows for dataset specialization and limits forgetting.

Non-Parametric Router

At inference, given a sample without known dataset identity, a non-parametric router predicts the relevance of each adapter to the input, based on which adapters to merge dynamically.

Dynamic Adapter Merging

Drawing inspiration from recent model merging techniques, we propose a dynamic adapter merging scheme. It aggregates the weights of all adapters based on the router's predictions to generate a new adapter instance tailored to each test sample. This dynamic merging not only lessens the impact of incorrect router predictions but also fosters knowledge sharing across domains, leading to improved VidQA performance.

Experimental Validation

We validate our approach on a benchmark comprising six VidQA datasets spanning various domains. Our experiments demonstrate that \Modelname~outperforms existing state-of-the-art continual learning techniques by 9.1\% on average accuracy while exhibiting 1.9\% less forgetting. Moreover, we show that our method can be effectively applied to other tasks like image classification and image QA, further underscoring its robustness and adaptability.

Analysis

Effectiveness of Adapter Merging

Our in-depth analysis reveals that adapter merging is particularly beneficial when facing many domains, where router prediction becomes challenging. Even with partially incorrect predictions, merging adapters facilitate utilizing cross-domain cues, enhancing overall performance.

Router Performance

Comparisons between different router designs indicate that our non-parametric router achieves the highest accuracy, underscoring the importance of an accurate and efficient router for domain-incremental VidQA learning.

Conclusion

Our work introduces a highly effective, generalizable, and parameter-efficient scheme for continual VidQA learning. By innovatively applying dynamic adapter merging, we demonstrate strong performance across various domains and tasks, indicating the method's potential for wider applications in continual learning scenarios.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 84 likes about this paper.