Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Test-Time Zero-Shot Temporal Action Localization (2404.05426v2)

Published 8 Apr 2024 in cs.CV

Abstract: Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and LLM (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Flamingo: a visual language model for Few-Shot learning. NeurIPS, 2022.
  2. End-to-end, single-stream temporal action detection in untrimmed videos. In BMVC, 2017.
  3. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, 2018.
  4. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, 2023.
  5. Bootstrap your own latent a new approach to self-supervised learning. In NeurIPS, 2020.
  6. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  7. The thumos challenge on action recognition for videos “in the wild”. CVIU, 2017.
  8. Openclip, 2021.
  9. Prompting visual-language models for efficient video understanding. In ECCV, 2021.
  10. Hmdb: A large video database for human motion recognition. In ICCV, 2011.
  11. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  12. BLIP-2: Bootstrapping Language-Image pre-training with frozen image encoders and large language models. arXiv, 2023.
  13. Grounded language-image pre-training. In CVPR, 2022b.
  14. Learning salient boundary feature for anchor-free temporal action localization. In CVPR, 2021.
  15. Single shot temporal action detection. In ACMMM, 2017.
  16. Video test-time adaptation for action recognition. In CVPR, 2023.
  17. Swapprompt: Test-time prompt adaptation for vision-language models. In NeurIPS, 2023.
  18. Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
  19. Verbs in action: Improving verb understanding in video-language models. In ICCV, 2023.
  20. Zero-shot temporal action detection via vision-language prompting. In ECCV, 2022.
  21. Test-time adaptation for egocentric action recognition. In ICIAP 2022, 2022.
  22. Learning transferable visual models from natural language supervision. In ICML, 2021.
  23. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
  24. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
  25. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
  26. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, 2023.
  27. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, 2017.
  28. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv, 2021.
  29. Source-free video domain adaptation by learning temporal consistency for action recognition. In ECCV, 2022.
  30. Unloc: A unified framework for video localization tasks. In ICCV, 2023.
  31. Coca: Contrastive captioners are image-text foundation models. arXiv, 2022.
  32. The unreasonable effectiveness of large language-vision models for source-free video domain adaptation. In ICCV, 2023.
  33. Memo: Test time robustness via adaptation and augmentation. In NeurIPS, 2022.
  34. Temporal action detection with structured segment networks. In ICCV, 2017.
  35. Conditional prompt learning for vision-language models. In CVPR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Benedetta Liberatori (4 papers)
  2. Alessandro Conti (11 papers)
  3. Paolo Rota (29 papers)
  4. Yiming Wang (141 papers)
  5. Elisa Ricci (137 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.