Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition (2312.01431v3)

Published 3 Dec 2023 in cs.CV

Abstract: Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In this work we present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$2$ST-Adapter), which is a novel adapter tuning framework well-suited for few-shot action recognition due to lightweight design and low parameter-learning overhead. It is designed in a dual-pathway architecture to encode spatial and temporal features in a disentangled manner. In particular, we devise the anisotropic Deformable Spatio-Temporal Attention module as the core component of D$2$ST-Adapter, which can be tailored with anisotropic sampling densities along spatial and temporal domains to learn spatial and temporal features specifically in corresponding pathways, allowing our D$2$ST-Adapter to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive experiments with instantiations of our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods for few-shot action recognition. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. TARN: Temporal attentive relation network for few-shot and zero-shot action recognition. In BMVC, 2019.
  2. Few-shot video classification via temporal alignment. In CVPR, pages 10618–10627, 2020.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
  4. Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, 35:16664–16678, 2022.
  5. ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  6. Crosstransformers: spatially-aware few-shot transfer. NeurIPS, 33:21981–21993, 2020.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  8. SlowFast networks for video recognition. In ICCV, pages 6202–6211, 2019.
  9. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
  10. The" something something" video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017.
  11. Low-shot visual recognition by shrinking and hallucinating features. In ICCV, pages 3018–3027, 2017.
  12. Towards a unified view of parameter-efficient transfer learning. In ICLR, 2021.
  13. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  14. Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799, 2019.
  15. Compound prototype matching for few-shot action recognition. In ECCV, pages 351–368, 2022.
  16. Task agnostic meta-learning for few-shot learning. In CVPR, pages 11719–11727, 2019.
  17. Visual prompt tuning. In ECCV, pages 709–727, 2022.
  18. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  19. HMDB: a large video database for human motion recognition. In ICCV, pages 2556–2563, 2011.
  20. Adversarial feature hallucination networks for few-shot learning. In CVPR, pages 13470–13479, 2020.
  21. Uniformer: Unified transformer for efficient spatiotemporal representation learning. In ICLR, 2022a.
  22. TA2N: Two-stage action alignment network for few-shot action recognition. In AAAI, pages 1404–1411, 2022b.
  23. Meinard Müller. Dynamic time warping. Information Retrieval for Music and Motion, pages 69–84, 2007.
  24. Inductive and transductive few-shot video classification via appearance and temporal alignments. In ECCV, pages 471–487, 2022.
  25. ST-Adapter: Parameter-efficient image-to-video transfer learning. In NeurIPS, 2022.
  26. Temporal-relational CrossTransformers for few-shot action recognition. In CVPR, pages 475–484, 2021.
  27. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  28. Optimization as a model for few-shot learning. In ICLR, 2017.
  29. Two-stream convolutional networks for action recognition in videos. NeurIPS, 27, 2014.
  30. Prototypical networks for few-shot learning. NeurIPS, 30, 2017.
  31. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  32. Learning to compare: Relation network for few-shot learning. In CVPR, pages 1199–1208, 2018.
  33. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. NeurIPS, 35:12991–13005, 2022.
  34. Spatio-temporal relation modeling for few-shot action recognition. In CVPR, pages 19958–19967, 2022.
  35. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
  36. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450–6459, 2018.
  37. Matching networks for one shot learning. NeurIPS, 29, 2016.
  38. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
  39. K-adapter: Infusing knowledge into pre-trained models with adapters. In ACL, pages 1405–1418, 2021.
  40. Hybrid relation guided set matching for few-shot action recognition. In CVPR, pages 19948–19957, 2022.
  41. Clip-guided prototype modulating for few-shot action recognition. arXiv preprint arXiv:2303.02982, 2023a.
  42. Molo: Motion-augmented long-short contrastive learning for few-shot action recognition. In CVPR, pages 18011–18021, 2023b.
  43. Low-shot learning from imaginary data. In CVPR, pages 7278–7286, 2018.
  44. Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR, pages 9151–9160, 2022.
  45. Few-shot video classification via representation fusion and promotion learning. In ICCV, pages 19311–19320, 2023.
  46. Vision transformer with deformable attention. In CVPR, pages 4794–4803, 2022.
  47. Revisiting the spatial and temporal modeling for few-shot action recognition. In AAAI, pages 3001–3009, 2023a.
  48. Boosting few-shot action recognition with graph-guided hybrid matching. In ICCV, pages 1740–1750, 2023b.
  49. AIM: Adapting image models for efficient video action recognition. In ICLR, 2022.
  50. Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pages 8808–8817, 2020.
  51. TAPNet: Neural network augmented with task-adaptive projection for few-shot learning. In ICML, pages 7115–7123, 2019.
  52. Few-shot action recognition with permutation-invariant attention. In ECCV, pages 525–542, 2020.
  53. MetaGAN: An adversarial approach to few-shot learning. NeurIPS, 31, 2018.
  54. Learning implicit temporal alignment for few-shot video classification. In IJCAI, 2021.
  55. Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV, pages 297–313, 2022.
  56. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
  57. Compound memory networks for few-shot video classification. In ECCV, pages 751–766, 2018.
  58. Label independent memory for semi-supervised few-shot video classification. IEEE TPAMI, 44(1):273–285, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wenjie Pei (56 papers)
  2. Qizhong Tan (1 paper)
  3. Guangming Lu (49 papers)
  4. Jiandong Tian (15 papers)
Citations (2)