Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Test-Time Zero-Shot Temporal Action Localization (2404.05426v2)

Published 8 Apr 2024 in cs.CV

Abstract: Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and LLM (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Flamingo: a visual language model for Few-Shot learning. NeurIPS, 2022.
  2. End-to-end, single-stream temporal action detection in untrimmed videos. In BMVC, 2017.
  3. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, 2018.
  4. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, 2023.
  5. Bootstrap your own latent a new approach to self-supervised learning. In NeurIPS, 2020.
  6. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  7. The thumos challenge on action recognition for videos “in the wild”. CVIU, 2017.
  8. Openclip, 2021.
  9. Prompting visual-language models for efficient video understanding. In ECCV, 2021.
  10. Hmdb: A large video database for human motion recognition. In ICCV, 2011.
  11. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  12. BLIP-2: Bootstrapping Language-Image pre-training with frozen image encoders and large language models. arXiv, 2023.
  13. Grounded language-image pre-training. In CVPR, 2022b.
  14. Learning salient boundary feature for anchor-free temporal action localization. In CVPR, 2021.
  15. Single shot temporal action detection. In ACMMM, 2017.
  16. Video test-time adaptation for action recognition. In CVPR, 2023.
  17. Swapprompt: Test-time prompt adaptation for vision-language models. In NeurIPS, 2023.
  18. Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
  19. Verbs in action: Improving verb understanding in video-language models. In ICCV, 2023.
  20. Zero-shot temporal action detection via vision-language prompting. In ECCV, 2022.
  21. Test-time adaptation for egocentric action recognition. In ICIAP 2022, 2022.
  22. Learning transferable visual models from natural language supervision. In ICML, 2021.
  23. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
  24. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
  25. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
  26. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, 2023.
  27. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, 2017.
  28. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv, 2021.
  29. Source-free video domain adaptation by learning temporal consistency for action recognition. In ECCV, 2022.
  30. Unloc: A unified framework for video localization tasks. In ICCV, 2023.
  31. Coca: Contrastive captioners are image-text foundation models. arXiv, 2022.
  32. The unreasonable effectiveness of large language-vision models for source-free video domain adaptation. In ICCV, 2023.
  33. Memo: Test time robustness via adaptation and augmentation. In NeurIPS, 2022.
  34. Temporal action detection with structured segment networks. In ICCV, 2017.
  35. Conditional prompt learning for vision-language models. In CVPR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Benedetta Liberatori (4 papers)
  2. Alessandro Conti (11 papers)
  3. Paolo Rota (29 papers)
  4. Yiming Wang (141 papers)
  5. Elisa Ricci (137 papers)
Citations (2)