TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter (2306.12642v1)
Abstract: Visual foundation models like CLIP excel in learning feature representations from extensive datasets through self-supervised methods, demonstrating remarkable transfer learning and generalization capabilities. A growing number of applications based on visual foundation models are emerging, including innovative solutions such as BLIP-2. These applications employ pre-trained CLIP models as upstream feature extractors and train various downstream modules to accomplish diverse tasks. In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient. In this paper, we introduce a parameter-efficient and task-agnostic adapter, dubbed TaCA, that facilitates compatibility across distinct foundation models while ensuring enhanced performance for the new models. TaCA allows downstream applications to seamlessly integrate better-performing foundation models without necessitating retraining. We conduct extensive experimental validation of TaCA using different scales of models with up to one billion parameters on various tasks such as video-text retrieval, video recognition, and visual question answering. The results consistently demonstrate the emergent ability of TaCA on hot-plugging upgrades for visual foundation models. Codes and models will be available at https://github.com/TencentARC/TaCA.
- Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. PMLR, 2021.
- Emerging properties in self-supervised vision transformers. In ICCV. 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Prompting visual-language models for efficient video understanding. In ECCV. Springer, 2022.
- Expanding language-image pretrained models for general video recognition. In ECCV. Springer, 2022.
- Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
- Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 388–404. Springer, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR. 2016.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913. 2017.
- Parameter-efficient fine-tuning for vision transformers. arXiv preprint arXiv:2203.16329, 2022.
- Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV. 2021.
- Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pages 2286–2296. PMLR, 2021.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR. 2022.
- Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623, 2022.
- Fine-tuned clip models are efficient video learners. arXiv preprint arXiv:2212.03640, 2022.
- St-adapter: Parameter-efficient image-to-video transfer learning for action recognition. arXiv preprint arXiv:2206.13559, 2022.
- Towards backward-compatible representation learning. In CVPR. 2020.
- Hot-refresh model upgrades with regression-free compatible training in image retrieval. In International Conference on Learning Representations. 2021.
- Forward compatible training for large-scale embedding retrieval systems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19386–19395. 2022.
- Towards universal backward-compatible representation learning. In IJCAI. 2022.
- Privacy-preserving model upgrades with bidirectional compatible training in image retrieval. arXiv preprint arXiv:2204.13919, 2022.
- Darwinian model upgrades: Model evolving with selective compatibility. arXiv preprint arXiv:2210.06954, 2022.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Revisit parameter-efficient transfer learning: A two-stage paradigm. arXiv preprint arXiv:2303.07910, 2023.
- Scaling & shifting your features: A new baseline for efficient model tuning. arXiv preprint arXiv:2210.08823, 2022.
- Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535, 2022.
- Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
- Conv-adapter: Exploring parameter efficient transfer learning for convnets. arXiv preprint arXiv:2208.07463, 2022.
- Attention is all you need. NeurIPS, 2017.
- Momentum contrast for unsupervised visual representation learning. In CVPR. 2020.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 2011.
- Localizing moments in video with natural language. In ICCV. 2017.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR. 2021.
- An overview of multi-task learning. National Science Review, 2018.
- Crawshaw, M. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR. 2022.
- Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487, 2019.
- Binjie Zhang (7 papers)
- Yixiao Ge (99 papers)
- Xuyuan Xu (10 papers)
- Ying Shan (252 papers)
- Mike Zheng Shou (165 papers)