Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling (2403.06978v1)

Published 11 Mar 2024 in cs.CV

Abstract: In this paper, we introduce Attention Prompt Tuning (APT) - a computationally efficient variant of prompt tuning for video-based applications such as action recognition. Prompt tuning approaches involve injecting a set of learnable prompts along with data tokens during fine-tuning while keeping the backbone frozen. This approach greatly reduces the number of learnable parameters compared to full tuning. For image-based downstream tasks, normally a couple of learnable prompts achieve results close to those of full tuning. However, videos, which contain more complex spatiotemporal information, require hundreds of tunable prompts to achieve reasonably good results. This reduces the parameter efficiency observed in images and significantly increases latency and the number of floating-point operations (FLOPs) during inference. To tackle these issues, we directly inject the prompts into the keys and values of the non-local attention mechanism within the transformer block. Additionally, we introduce a novel prompt reparameterization technique to make APT more robust against hyperparameter selection. The proposed APT approach greatly reduces the number of FLOPs and latency while achieving a significant performance boost over the existing parameter-efficient tuning methods on UCF101, HMDB51, and SSv2 datasets for action recognition. The code and pre-trained models are available at https://github.com/wgcban/apt

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Wele Gedara Chaminda Bandara and Vishal M. Patel. Hypertransformer: A textural and spectral feature fusion transformer for pansharpening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1767–1777, 2022a.
  3. Wele Gedara Chaminda Bandara and Vishal M. Patel. A Transformer-Based Siamese Network for Change Detection. In IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, pages 207–210, 2022b. ISSN: 2153-7003.
  4. Wele Gedara Chaminda Bandara and Vishal M. Patel. A transformer-based siamese network for change detection. In IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, pages 207–210, 2022c.
  5. AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Canada, 2022. arXiv:2211.09120 [cs].
  6. Understanding Robustness of Transformers for Image Classification, 2021.
  7. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, 2017.
  8. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, 2021. arXiv:2102.04306 [cs].
  9. Generative Pretraining From Pixels. In Proceedings of the 37th International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
  10. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition, 2022. arXiv:2205.13535 [cs].
  11. RandAugment: Practical automated data augmentation with a reduced search space, 2019. arXiv:1909.13719 [cs].
  12. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):677–691, 2017.
  13. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021. arXiv:2010.11929 [cs].
  14. Convolutional two-stream network fusion for video action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1933–1941, 2016.
  15. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  16. Masked Autoencoders As Spatiotemporal Learners, 2022. arXiv:2205.09113 [cs].
  17. OmniMAE: Single Model Masked Pretraining on Images and Videos, 2022. arXiv:2206.08356 [cs, stat].
  18. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10699–10709, 2022.
  19. The ”something something” video database for learning and evaluating visual common sense, 2017. arXiv:1706.04261 [cs].
  20. E22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTvpt: An effective and efficient approach for visual prompt tuning, 2023.
  21. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  22. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018.
  23. Towards a unified view of parameter-efficient transfer learning, 2022.
  24. Masked Autoencoders Are Scalable Vision Learners, 2021. arXiv:2111.06377 [cs].
  25. Visual Prompt Tuning, 2022. arXiv:2203.12119 [cs].
  26. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  27. The Kinetics Human Action Video Dataset, 2017. arXiv:1705.06950 [cs].
  28. A simple weight decay can improve generalization. Advances in neural information processing systems, 4, 1991.
  29. HMDB: A Large Video Database for Human Motion Recognition, 2011.
  30. The power of scale for parameter-efficient prompt tuning, 2021.
  31. Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021. arXiv:2101.00190 [cs].
  32. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, Dublin, Ireland, 2022. Association for Computational Linguistics.
  33. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  34. Decoupled Weight Decay Regularization, 2019. arXiv:1711.05101 [cs, math].
  35. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022a.
  36. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning, 2022b. arXiv:2206.13559 [cs].
  37. Image Transformer. In Proceedings of the 35th International Conference on Machine Learning, pages 4055–4064. PMLR, 2018. ISSN: 2640-3498.
  38. Transformer-based sar image despeckling. In IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, pages 751–754, 2022.
  39. Spatiotemporal Contrastive Video Representation Learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6960–6970, Nashville, TN, USA, 2021. IEEE.
  40. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  41. Spatio-Temporal Crop Aggregation for Video Representation Learning, 2022. arXiv:2211.17042 [cs].
  42. Transformers in action recognition: A review on temporal modeling. arXiv preprint arXiv:2302.01921, 2022.
  43. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, 2012. arXiv:1212.0402 [cs].
  44. Segmenter: Transformer for Semantic Segmentation, 2021.
  45. Fedperfix: Towards partial model personalization of vision transformers in federated learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4988–4998, 2023.
  46. Msp: Multi-stage prompting for making pre-trained language models better translators. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6131–6142, 2022.
  47. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems, 2022. arXiv:2203.12602 [cs].
  48. Attention Is All You Need, 2017. arXiv:1706.03762 [cs].
  49. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE international conference on computer vision, pages 4041–4049, 2015.
  50. Action recognition with trajectory-pooled deep-convolutional descriptors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4305–4314, 2015.
  51. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  52. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  53. Dualprompt: Complementary prompting for rehearsal-free continual learning. In European Conference on Computer Vision, pages 631–648. Springer, 2022.
  54. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), pages 305–321, 2018.
  55. Aim: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024, 2023.
  56. Towards a unified view on visual parameter-efficient transfer learning, 2023.
  57. StyleSwin: Transformer-Based GAN for High-Resolution Image Generation, 2022.
  58. A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567, 2020.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com