Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters (2403.11549v2)

Published 18 Mar 2024 in cs.CV

Abstract: Continual learning can empower vision-LLMs to continuously acquire new knowledge, without the need for access to the entire historical dataset. However, mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-LLMs. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-LLMs, we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP, respectively. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%. Our code locates at https://github.com/JiazuoYu/MoE-Adapters4CL

Summary of "Boosting Continual Learning of Vision-LLMs via Mixture-of-Experts Adapters"

The paper "Boosting Continual Learning of Vision-LLMs via Mixture-of-Experts Adapters" presents a parameter-efficient framework specifically designed to enhance the continual learning capabilities of large-scale vision-LLMs such as CLIP. The focus is on addressing the challenges of long-term forgetting and computational burdens typically associated with continual learning (CL) systems.

The authors introduce a novel architecture featuring Mixture-of-Experts (MoE) adapters, which facilitate the dynamic expansion of models to adapt to new tasks while preserving previously learned knowledge. The integration of MoE adapters serves to efficiently manage the adaptivity of the model to both seen and unseen data. A key component of this architecture is the Distribution Discriminative Auto-Selector (DDAS), which enables automatic task recognition, preserving zero-shot capabilities by routing in-distribution inputs to the MoE adapters and out-of-distribution inputs to the original CLIP model.

Key Contributions

  1. Parameter-Efficient MoE-Adapters: By leveraging MoE structures with dynamic router mechanisms, the authors propose a training framework that reduces the parameter training burden by 60% compared to existing state-of-the-art methods. The training approach adopts a novel activate-freeze strategy to facilitate both intra-task learning and inter-task knowledge sharing among experts, marked by the systematic activation of specific experts based on task-related features.
  2. Distribution Discriminative Auto-Selector (DDAS): The authors propose DDAS to automatically determine the task identity by predicting data distribution variations. This mechanism ensures effective routing of input data, either to exploit the fine-tuned expertise encapsulated in MoE adapters or to retain the zero-shot generalization abilities of the frozen CLIP model.
  3. Extensive Evaluation: Empirical results provided indicate that the proposed method consistently surpasses prior state-of-the-art solutions across multiple continual learning benchmarks. Notably, the approach demonstrates robustness even under few-shot settings, significantly enhancing memory retention of past tasks and maintaining compelling zero-shot performance.

Implications and Future Prospects

This paper contributes significantly to the field of continual learning by demonstrating that appropriate architectural modifications can substantially enhance parameter efficiency and task adaptability in large-scale models. The incorporation of MoE-based dynamic adjustment strategies showcases a promising direction towards addressing both the catastrophic forgetting and generalization challenges inherent to lifelong learning settings.

From a theoretical perspective, this work exemplifies the potential of leveraging sparse expert models for balancing task-specific and generalizable knowledge in neural architectures. Practically, such advancements could be vital in deploying AI models across dynamic and resource-constrained environments, where adaptive learning and fast scalability are critical.

Looking forward, potential extensions of this work include investigating the impact of varying the number of experts, exploring alternative selection strategies in the MoE frameworks, and refining the automatic data distribution mechanisms to further streamline task recognition without manual threshold adjustments or identity references. Additionally, the approach's application to domains beyond vision-language tasks, such as NLP and reinforcement learning, could be explored to evaluate its adaptability and effectiveness in different data paradigms.

The framework presented in this paper creates pathways for efficient lifelong learning systems, promoting a shift towards more robust and scalable AI applications capable of seamlessly integrating new information over extended operational lifespans.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3366–3375, 2017.
  2. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  3. Il2m: Class incremental learning with dual memory. In Proceedings of the IEEE/CVF international conference on computer vision, pages 583–592, 2019.
  4. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233–248, 2018.
  7. Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pages 5383–5395. PMLR, 2023a.
  8. Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11828–11837, 2023b.
  9. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  10. Mobile v-moes: Scaling down vision transformers via sparse mixture-of-experts. arXiv preprint arXiv:2309.04354, 2023.
  11. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021.
  12. On the effectiveness of layernorm tuning for continual learning in vision transformers. arXiv preprint arXiv:2308.09610, 2023.
  13. Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
  14. Learning without memorizing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5138–5146, 2019.
  15. Don’t stop learning: Towards continual learning for the clip model. arXiv preprint arXiv:2207.09248, 2022.
  16. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  18. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 86–102. Springer, 2020.
  19. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9285–9295, 2022.
  20. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  21. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  22. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
  23. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  24. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
  25. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  26. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 831–839, 2019.
  27. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  28. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  29. Dense network expansion for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11858–11867, 2023.
  30. Selective experience replay for lifelong learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  31. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  32. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  33. Class-incremental learning using diffusion model for distillation and replay. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3425–3433, 2023.
  34. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
  35. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
  36. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  37. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  38. Learning multiple layers of features from tiny images. 2009.
  39. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  40. Continual classification learning using generative models. arXiv preprint arXiv:1810.10612, 2018.
  41. Overcoming catastrophic forgetting by incremental moment matching. Advances in neural information processing systems, 30, 2017.
  42. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  43. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  44. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  45. Class incremental learning with pre-trained vision-language models, 2023c.
  46. More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pages 699–716. Springer, 2020.
  47. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
  48. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  49. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  50. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018.
  51. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, pages 109–165. Elsevier, 1989.
  52. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  53. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
  54. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  55. OpenAI OpenAI. Gpt-4 technical report. 2023.
  56. In defense of the learning without forgetting for task incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2209–2218, 2021.
  57. Continual learning with foundation models: An empirical study of latent replay. In Conference on Lifelong Learning Agents, pages 60–91. PMLR, 2022.
  58. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  59. Gdumb: A simple approach that questions our progress in continual learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 524–540. Springer, 2020.
  60. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  61. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  62. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  63. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  64. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
  65. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  67. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  68. Large scale incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 374–382, 2019.
  69. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  70. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3014–3023, 2021.
  71. Self-evolved dynamic expansion model for task-free continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22102–22112, 2023.
  72. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
  73. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  74. Continual learning through synaptic intelligence. In International conference on machine learning, pages 3987–3995. PMLR, 2017.
  75. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer, 2020.
  76. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  77. Mixture of attention heads: Selecting attention heads per token. arXiv preprint arXiv:2210.05144, 2022.
  78. Preventing zero-shot transfer degradation in continual learning of vision-language models. arXiv preprint arXiv:2303.06628, 2023.
  79. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  80. Prototype augmentation and self-supervision for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5871–5880, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiazuo Yu (3 papers)
  2. Yunzhi Zhuge (17 papers)
  3. Lu Zhang (373 papers)
  4. Dong Wang (628 papers)
  5. Huchuan Lu (199 papers)
  6. You He (13 papers)
  7. Ping Hu (49 papers)
Citations (39)
X Twitter Logo Streamline Icon: https://streamlinehq.com