Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition (2311.15619v3)

Published 27 Nov 2023 in cs.CV and cs.AI

Abstract: Large-scale visual-language pre-trained models have achieved significant success in various video tasks. However, most existing methods follow an "adapt then align" paradigm, which adapts pre-trained image encoders to model video-level representations and utilizes one-hot or text embedding of the action labels for supervision. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper, we propose a novel "Align before Adapt" (ALT) paradigm. Prior to adapting to video representation learning, we exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus. With the aligned entities, we feed their text embeddings to a transformer-based video adapter as the queries, which can help extract the semantics of the most important entities from a video to a vector. This paradigm reuses the visual-language alignment of VLP during adaptation and tries to explain an action by the underlying entities. This helps understand actions by bridging the gap with complex activity semantics, particularly when facing unfamiliar or unseen categories. ALT demonstrates competitive performance while maintaining remarkably low computational costs. In fully supervised experiments, it achieves 88.1% top-1 accuracy on Kinetics-400 with only 4947 GFLOPs. Moreover, ALT outperforms the previous state-of-the-art methods in both zero-shot and few-shot experiments, emphasizing its superior generalizability across various learning scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Vivit: A video vision transformer. In ICCV, pages 6816–6826, 2021.
  2. An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In COLING, pages 1591–1600. ACL, 2014.
  3. Is space-time attention all you need for video understanding? In ICML, pages 813–824, 2021.
  4. Natural Language Processing with Python. O’Reilly, 2009.
  5. Token merging: Your vit but faster. CoRR, abs/2210.09461, 2022.
  6. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pages 4724–4733, 2017.
  7. A short note about kinetics-600. CoRR, abs/1808.01340, 2018.
  8. Regionvit: Regional-to-local attention for vision transformers. In ICLR. OpenReview.net, 2022a.
  9. gscorecam: What objects is CLIP looking at? In ACCV (4), pages 588–604. Springer, 2022b.
  10. Elaborative rehearsal for zero-shot action recognition. In ICCV, pages 13618–13627, 2021.
  11. Video action recognition with attentive semantic units. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10170–10180, 2023.
  12. Multiscale vision transformers. In ICCV, pages 6804–6815, 2021.
  13. Pairwise body-part attention for recognizing human-object interactions. In ECCV (10), pages 52–68, 2018.
  14. Christoph Feichtenhofer. X3D: expanding architectures for efficient video recognition. In CVPR, pages 200–210, 2020.
  15. Slowfast networks for video recognition. In ICCV, pages 6201–6210, 2019.
  16. Devise: A deep visual-semantic embedding model. In NeurIPS, pages 2121–2129, 2013.
  17. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In AAAI, pages 8303–8311, 2019.
  18. Scaling open-vocabulary image segmentation with image-level labels. In ECCV (36), pages 540–557. Springer, 2022.
  19. All about knowledge graphs for actions. CoRR, abs/2008.12432, 2020.
  20. The "something something" video database for learning and evaluating visual common sense. In ICCV, pages 5843–5851. IEEE Computer Society, 2017.
  21. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  22. Learning spatio-temporal features with 3d residual networks for action recognition. In ICCV Workshops, pages 3154–3160, 2017.
  23. Global context vision transformers. In ICML, pages 12633–12646. PMLR, 2023.
  24. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
  25. Categorical reparameterization with gumbel-softmax. In ICLR (Poster). OpenReview.net, 2017.
  26. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
  27. Prompting visual-language models for efficient video understanding, 2022.
  28. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  29. HMDB: A large video database for human motion recognition. In ICCV, pages 2556–2563, 2011.
  30. Segmentation in the perception and memory of events. Trends in cognitive sciences, pages 72–79, 2008.
  31. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR, 2022a.
  32. Uniformer: Unifying convolution and self-attention for visual recognition. ICLR, 2022b.
  33. Grounded language-image pre-training. In CVPR, pages 10955–10965. IEEE, 2022c.
  34. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV (30), pages 121–137, 2020a.
  35. Pastanet: Toward human activity knowledge engine. In CVPR, pages 379–388, 2020b.
  36. TSM: temporal shift module for efficient video understanding. In ICCV, pages 7082–7092, 2019.
  37. Frozen CLIP models are efficient video learners, 2022.
  38. Video swin transformer. In CVPR, pages 3192–3201, 2022.
  39. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR (Poster). OpenReview.net, 2017.
  40. George A. Miller. Wordnet: A lexical database for english. Commun. ACM, pages 39–41, 1995.
  41. Clipcap: CLIP prefix for image captioning. CoRR, abs/2111.09734, 2021.
  42. Clip-it! language-guided video summarization, 2021.
  43. Expanding language-image pretrained models for general video recognition. In ECCV (4), pages 1–18, 2022.
  44. OpenAI. Chatgpt: Optimizing language models for dialogue. https://chat.openai.com/, 2022.
  45. St-adapter: Parameter-efficient image-to-video transfer learning for action recognition. In NeurIPS, 2022.
  46. Keeping your eye on the ball: Trajectory attention in video transformers. In NeurIPS, pages 12493–12506, 2021.
  47. Zero-shot action recognition with error-correcting output codes. In CVPR, pages 1042–1051, 2017.
  48. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, pages 5534–5542, 2017.
  49. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  50. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.
  51. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18061–18070. IEEE, 2022.
  52. Fine-tuned CLIP models are efficient video learners. In CVPR, pages 6545–6554. IEEE, 2023.
  53. Bernardino Romera-Paredes and Philip H. S. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, pages 2152–2161, 2015.
  54. Tokenlearner: Adaptive space-time tokenization for videos. In NeurIPS, pages 12786–12797, 2021.
  55. Two-stream convolutional networks for action recognition in videos. In NeurIPS, pages 568–576, 2014.
  56. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  57. VL-BERT: pre-training of generic visual-linguistic representations. In ICLR. OpenReview.net, 2020.
  58. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022.
  59. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
  60. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450–6459, 2018.
  61. Neural discrete representation learning. In NIPS, pages 6306–6315, 2017.
  62. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  63. Incorporating word sense disambiguation in neural language models. arXiv preprint arXiv:2106.07967, 2021.
  64. Temporal segment networks: Towards good practices for deep action recognition. In ECCV (8), pages 20–36, 2016.
  65. Actionclip: A new paradigm for video action recognition. CoRR, abs/2109.08472, 2021.
  66. Alternative semantic representations for zero-shot human action recognition. In ECML/PKDD (1), pages 87–102, 2017.
  67. CRIS: clip-driven referring image segmentation. In CVPR, pages 11676–11685, 2022.
  68. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV (15), pages 318–335, 2018.
  69. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, pages 18113–18123. IEEE, 2022.
  70. Open-vocabulary panoptic segmentation with text-to-image diffusion models. CVPR, abs/2303.04803, 2023.
  71. Multiview transformers for video recognition. In CVPR, pages 3323–3333, 2022.
  72. AIM: adapting image models for efficient video action recognition. CoRR, abs/2302.03024, 2023.
  73. FILIP: fine-grained interactive language-image pre-training. In ICLR. OpenReview.net, 2022.
  74. Coca: Contrastive captioners are image-text foundation models. CoRR, abs/2205.01917, 2022.
  75. Florence: A new foundation model for computer vision. CoRR, abs/2111.11432, 2021.
  76. Perceiving, remembering, and communicating structure in events. Journal of experimental psychology: General, page 29, 2001.
  77. Multi-grained vision language pre-training: Aligning texts with visual concepts. In ICML, pages 25994–26009. PMLR, 2022.
  78. Co-training transformer with videos and images improves action recognition. CoRR, abs/2112.07175, 2021.
  79. Learning a deep embedding model for zero-shot learning. In CVPR, pages 3010–3019, 2017.
  80. Temporal relational reasoning in videos. In ECCV (1), pages 831–846, 2018.
  81. Towards universal representation for unseen action recognition. In CVPR, pages 9436–9445, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yifei Chen (58 papers)
  2. Dapeng Chen (33 papers)
  3. Ruijin Liu (13 papers)
  4. Sai Zhou (4 papers)
  5. Wenyuan Xue (4 papers)
  6. Wei Peng (164 papers)