DAM: Dynamic Adapter Merging for Continual Video QA Learning (2403.08755v2)

Published 13 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Given a set of continually streaming VidQA datasets, we sequentially train dataset-specific adapters for each dataset while freezing the parameters of a large pretrained video-language backbone. During inference, given a video-question sample from an unknown domain, our method first uses the proposed non-parametric router function to compute a probability for each adapter, reflecting how relevant that adapter is to the current video-question input instance. Subsequently, the proposed dynamic adapter merging scheme aggregates all the adapter weights into a new adapter instance tailored for that particular test sample to compute the final VidQA prediction, mitigating the impact of inaccurate router predictions and facilitating knowledge sharing across domains. Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains. We further extend DAM to continual image classification and image QA and outperform prior methods by a large margin. The code is publicly available at: https://github.com/klauscc/DAM

References (88)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces Dynamic Adapter Merging, a novel method that dynamically merges dataset-specific adapters to mitigate catastrophic forgetting in continual video QA learning.
It leverages a frozen pretrained backbone with a non-parametric router to combine cross-domain insights, achieving a 9.1% accuracy improvement and reducing forgetting by 1.9%.
The approach demonstrates versatility by extending its robust, parameter-efficient strategy to tasks like image classification and image QA, highlighting its broad potential in continual learning.

Dynamic Adapter Merging for Continual Video Question Answering Learning

Introduction

Continual Learning (CL) of video question-answering (VidQA) models faces significant challenges, including catastrophic forgetting, adaptability to new datasets, and domain-inference during inference. Addressing these issues, we introduce a novel, parameter-efficient method named Dynamic Adapter Merging (\Modelname). This approach is designed for effective continual VidQA learning, enabling the model to adapt to sequentially streaming VidQA datasets without retraining from scratch or retaining previous data, thus significantly reducing catastrophic forgetting.

Approach

\Modelname~comprises several key components:

Freezing the Backbone

Our method leverages a large pretrained video-LLM as the backbone, which remains frozen to mitigate catastrophic forgetting.

Dataset-Specific Adapters

For each new dataset, \Modelname~trains a dataset-specific adapter while keeping the backbone and previously trained adapters frozen. This setup allows for dataset specialization and limits forgetting.

Non-Parametric Router

At inference, given a sample without known dataset identity, a non-parametric router predicts the relevance of each adapter to the input, based on which adapters to merge dynamically.

Dynamic Adapter Merging

Drawing inspiration from recent model merging techniques, we propose a dynamic adapter merging scheme. It aggregates the weights of all adapters based on the router's predictions to generate a new adapter instance tailored to each test sample. This dynamic merging not only lessens the impact of incorrect router predictions but also fosters knowledge sharing across domains, leading to improved VidQA performance.

Experimental Validation

We validate our approach on a benchmark comprising six VidQA datasets spanning various domains. Our experiments demonstrate that \Modelname~outperforms existing state-of-the-art continual learning techniques by 9.1\% on average accuracy while exhibiting 1.9\% less forgetting. Moreover, we show that our method can be effectively applied to other tasks like image classification and image QA, further underscoring its robustness and adaptability.

Analysis

Effectiveness of Adapter Merging

Our in-depth analysis reveals that adapter merging is particularly beneficial when facing many domains, where router prediction becomes challenging. Even with partially incorrect predictions, merging adapters facilitate utilizing cross-domain cues, enhancing overall performance.

Router Performance

Comparisons between different router designs indicate that our non-parametric router achieves the highest accuracy, underscoring the importance of an accurate and efficient router for domain-incremental VidQA learning.

Conclusion

Our work introduces a highly effective, generalizable, and parameter-efficient scheme for continual VidQA learning. By innovatively applying dynamic adapter merging, we demonstrate strong performance across various domains and tasks, indicating the method's potential for wider applications in continual learning scenarios.