Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning (2311.15769v1)
Abstract: Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.
- Vivit: A video vision transformer. In ICCV, pages 6836–6846, 2021.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Is space-time attention all you need for video understanding? In ICML, page 4, 2021.
- Cross modal retrieval with querybank normalisation. In CVPR, pages 5194–5205, 2022.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
- Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, pages 2818–2829, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022.
- Scaling vision transformers to 22 billion parameters. In ICML, pages 7480–7512. PMLR, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Uatvr: Uncertainty-adaptive text-video retrieval. In ICCV, pages 13723–13733, 2023.
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
- Multi-modal transformer for video retrieval. In ECCV, pages 214–229. Springer, 2020.
- The” something something” video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017.
- Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. In ICLR, 2021.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456. pmlr, 2015.
- Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623, 2022.
- Prompting visual-language models for efficient video understanding. In ECCV, pages 105–124. Springer, 2022.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Uniformerv2: Unlocking the potential of image vits for video understanding. In ICCV, pages 1632–1643, 2023.
- Tsm: Temporal shift module for efficient video understanding. In ICCV, pages 7083–7093, 2019.
- Frozen clip models are efficient video learners. In ECCV, pages 388–404. Springer, 2022.
- Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In CVPR, pages 6555–6564, 2023.
- Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV, pages 319–335. Springer, 2022a.
- Video swin transformer. In CVPR, pages 3202–3211, 2022b.
- Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling. arXiv preprint arXiv:2302.06605, 2023.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pages 2630–2640, 2019.
- Expanding language-image pretrained models for general video recognition. In ECCV, pages 1–18. Springer, 2022.
- St-adapter: Parameter-efficient image-to-video transfer learning. NeurIPS, 35:26462–26477, 2022.
- Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In ICCV, pages 13934–13944, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. NeurIPS, 35:12991–13005, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
- Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4581–4591, 2019.
- Mvfnet: Multi-view fusion network for efficient video recognition. In AAAI, pages 2943–2951, 2021a.
- Dsanet: Dynamic segment aggregation network for video-level representation learning. In ACM MM, pages 1903–1911, 2021b.
- Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, pages 10704–10713, 2023a.
- What can simple arithmetic operations do for temporal modeling? In ICCV, pages 13712–13722, 2023b.
- Revisiting classifier: Transferring vision-language models for video recognition. In AAAI, pages 2847–2855, 2023c.
- Transferring vision-language models for visual recognition: A classifier perspective. IJCV, 2023d.
- Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, pages 6620–6630, 2023e.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023.
- Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022.
- Aim: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024, 2023.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022.
- Scaling vision transformers. In CVPR, pages 12104–12113, 2022.
- Token shift transformer for video classification. In ACM MM, pages 917–925, 2021.
- Side-tuning: a baseline for network adaptation via additive side networks. In ECCV, pages 698–714. Springer, 2020.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.