D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition (2312.01431v3)
Abstract: Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In this work we present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$2$ST-Adapter), which is a novel adapter tuning framework well-suited for few-shot action recognition due to lightweight design and low parameter-learning overhead. It is designed in a dual-pathway architecture to encode spatial and temporal features in a disentangled manner. In particular, we devise the anisotropic Deformable Spatio-Temporal Attention module as the core component of D$2$ST-Adapter, which can be tailored with anisotropic sampling densities along spatial and temporal domains to learn spatial and temporal features specifically in corresponding pathways, allowing our D$2$ST-Adapter to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive experiments with instantiations of our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods for few-shot action recognition. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition.
- TARN: Temporal attentive relation network for few-shot and zero-shot action recognition. In BMVC, 2019.
- Few-shot video classification via temporal alignment. In CVPR, pages 10618–10627, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
- Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, 35:16664–16678, 2022.
- ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- Crosstransformers: spatially-aware few-shot transfer. NeurIPS, 33:21981–21993, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- SlowFast networks for video recognition. In ICCV, pages 6202–6211, 2019.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
- The" something something" video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017.
- Low-shot visual recognition by shrinking and hallucinating features. In ICCV, pages 3018–3027, 2017.
- Towards a unified view of parameter-efficient transfer learning. In ICLR, 2021.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799, 2019.
- Compound prototype matching for few-shot action recognition. In ECCV, pages 351–368, 2022.
- Task agnostic meta-learning for few-shot learning. In CVPR, pages 11719–11727, 2019.
- Visual prompt tuning. In ECCV, pages 709–727, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- HMDB: a large video database for human motion recognition. In ICCV, pages 2556–2563, 2011.
- Adversarial feature hallucination networks for few-shot learning. In CVPR, pages 13470–13479, 2020.
- Uniformer: Unified transformer for efficient spatiotemporal representation learning. In ICLR, 2022a.
- TA2N: Two-stage action alignment network for few-shot action recognition. In AAAI, pages 1404–1411, 2022b.
- Meinard Müller. Dynamic time warping. Information Retrieval for Music and Motion, pages 69–84, 2007.
- Inductive and transductive few-shot video classification via appearance and temporal alignments. In ECCV, pages 471–487, 2022.
- ST-Adapter: Parameter-efficient image-to-video transfer learning. In NeurIPS, 2022.
- Temporal-relational CrossTransformers for few-shot action recognition. In CVPR, pages 475–484, 2021.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- Optimization as a model for few-shot learning. In ICLR, 2017.
- Two-stream convolutional networks for action recognition in videos. NeurIPS, 27, 2014.
- Prototypical networks for few-shot learning. NeurIPS, 30, 2017.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Learning to compare: Relation network for few-shot learning. In CVPR, pages 1199–1208, 2018.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. NeurIPS, 35:12991–13005, 2022.
- Spatio-temporal relation modeling for few-shot action recognition. In CVPR, pages 19958–19967, 2022.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
- A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450–6459, 2018.
- Matching networks for one shot learning. NeurIPS, 29, 2016.
- Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
- K-adapter: Infusing knowledge into pre-trained models with adapters. In ACL, pages 1405–1418, 2021.
- Hybrid relation guided set matching for few-shot action recognition. In CVPR, pages 19948–19957, 2022.
- Clip-guided prototype modulating for few-shot action recognition. arXiv preprint arXiv:2303.02982, 2023a.
- Molo: Motion-augmented long-short contrastive learning for few-shot action recognition. In CVPR, pages 18011–18021, 2023b.
- Low-shot learning from imaginary data. In CVPR, pages 7278–7286, 2018.
- Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR, pages 9151–9160, 2022.
- Few-shot video classification via representation fusion and promotion learning. In ICCV, pages 19311–19320, 2023.
- Vision transformer with deformable attention. In CVPR, pages 4794–4803, 2022.
- Revisiting the spatial and temporal modeling for few-shot action recognition. In AAAI, pages 3001–3009, 2023a.
- Boosting few-shot action recognition with graph-guided hybrid matching. In ICCV, pages 1740–1750, 2023b.
- AIM: Adapting image models for efficient video action recognition. In ICLR, 2022.
- Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pages 8808–8817, 2020.
- TAPNet: Neural network augmented with task-adaptive projection for few-shot learning. In ICML, pages 7115–7123, 2019.
- Few-shot action recognition with permutation-invariant attention. In ECCV, pages 525–542, 2020.
- MetaGAN: An adversarial approach to few-shot learning. NeurIPS, 31, 2018.
- Learning implicit temporal alignment for few-shot video classification. In IJCAI, 2021.
- Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV, pages 297–313, 2022.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
- Compound memory networks for few-shot video classification. In ECCV, pages 751–766, 2018.
- Label independent memory for semi-supervised few-shot video classification. IEEE TPAMI, 44(1):273–285, 2020.
- Wenjie Pei (56 papers)
- Qizhong Tan (1 paper)
- Guangming Lu (49 papers)
- Jiandong Tian (15 papers)