Test-Time Zero-Shot Temporal Action Localization (2404.05426v2)
Abstract: Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and LLM (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.
- Flamingo: a visual language model for Few-Shot learning. NeurIPS, 2022.
- End-to-end, single-stream temporal action detection in untrimmed videos. In BMVC, 2017.
- Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, 2018.
- Finetune like you pretrain: Improved finetuning of zero-shot vision models. In CVPR, 2023.
- Bootstrap your own latent a new approach to self-supervised learning. In NeurIPS, 2020.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- The thumos challenge on action recognition for videos “in the wild”. CVIU, 2017.
- Openclip, 2021.
- Prompting visual-language models for efficient video understanding. In ECCV, 2021.
- Hmdb: A large video database for human motion recognition. In ICCV, 2011.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
- BLIP-2: Bootstrapping Language-Image pre-training with frozen image encoders and large language models. arXiv, 2023.
- Grounded language-image pre-training. In CVPR, 2022b.
- Learning salient boundary feature for anchor-free temporal action localization. In CVPR, 2021.
- Single shot temporal action detection. In ACMMM, 2017.
- Video test-time adaptation for action recognition. In CVPR, 2023.
- Swapprompt: Test-time prompt adaptation for vision-language models. In NeurIPS, 2023.
- Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
- Verbs in action: Improving verb understanding in video-language models. In ICCV, 2023.
- Zero-shot temporal action detection via vision-language prompting. In ECCV, 2022.
- Test-time adaptation for egocentric action recognition. In ICIAP 2022, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
- Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
- Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
- Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, 2023.
- R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, 2017.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv, 2021.
- Source-free video domain adaptation by learning temporal consistency for action recognition. In ECCV, 2022.
- Unloc: A unified framework for video localization tasks. In ICCV, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv, 2022.
- The unreasonable effectiveness of large language-vision models for source-free video domain adaptation. In ICCV, 2023.
- Memo: Test time robustness via adaptation and augmentation. In NeurIPS, 2022.
- Temporal action detection with structured segment networks. In ICCV, 2017.
- Conditional prompt learning for vision-language models. In CVPR, 2022.
- Benedetta Liberatori (4 papers)
- Alessandro Conti (11 papers)
- Paolo Rota (29 papers)
- Yiming Wang (141 papers)
- Elisa Ricci (137 papers)