Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features (2404.19542v1)

Published 30 Apr 2024 in cs.CV

Abstract: Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach that expands Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) capabilities. Closed-vocab TAD is typically confined to localizing and classifying actions based on a predefined set of categories. In contrast, Open-vocab TAD goes further and is not limited to these predefined categories. This is particularly useful in real-world scenarios where the variety of actions in videos can be vast and not always predictable. The prevalent methods in Open-vocab TAD typically employ a 2-stage approach, which involves generating action proposals and then identifying those actions. However, errors made during the first stage can adversely affect the subsequent action identification accuracy. Additionally, existing studies face challenges in handling actions of different durations owing to the use of fixed temporal processing methods. Therefore, we propose a 1-stage approach consisting of two primary modules: Multi-scale Video Analysis (MVA) and Video-Text Alignment (VTA). The MVA module captures actions at varying temporal resolutions, overcoming the challenge of detecting actions with diverse durations. The VTA module leverages the synergy between visual and textual modalities to precisely align video segments with corresponding action labels, a critical step for accurate action identification in Open-vocab scenarios. Evaluations on widely recognized datasets THUMOS14 and ActivityNet-1.3, showed that the proposed method achieved superior results compared to the other methods in both Open-vocab and Closed-vocab settings. This serves as a strong demonstration of the effectiveness of the proposed method in the TAD task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. OpenTAL: Towards open set temporal action localization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2979–2989, 2022.
  3. Soft-NMS—Improving object detection with one line of code. In Proceedings of the 16th IEEE International Conference on Computer Vision, pages 5561–5569, 2017.
  4. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, volume 1, pages 213–229, 2020.
  5. J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  6. Cascade evidential learning for open-world weakly-supervised temporal action localization. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14741–14750, 2023.
  7. Enabling multimodal generation on CLIP via vision-language knowledge distillation. Computing Research Repository arXiv Preprints, arXiv:2203.06386, 2022.
  8. CLIP-Nav: Using CLIP for zero-shot vision-and-language navigation. Computing Research Repository arXiv Preprints, arXiv:2211.16649, 2022.
  9. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
  10. Open-vocabulary object detection via vision and language knowledge distillation. Computing Research Repository arXiv Preprints, arXiv:2104.13921, 2021.
  11. D. Hendrycks and K. Gimpel. Gaussian Error Linear Units (GELUs). Computing Research Repository arXiv Preprints, arXiv:1606.08415, 2016.
  12. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
  13. Seeing what you miss: Vision-language pre-training with semantic completion learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6789–6798, 2023.
  14. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 2021 International Conference on Machine Learning, pages 4904–4916, 2021.
  15. Prompting visual-language models for efficient video understanding. In Proceedings of the 17th European Conference on Computer Vision, volume 35, pages 105–124, 2022.
  16. Contrastive representation learning: A framework and review. IEEE Access, 8:193907–193934, 2020.
  17. Temporal convolutional networks for action segmentation and detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017.
  18. SemVLP: Vision-language pre-training by aligning semantics at multiple levels. Computing Research Repository arXiv Preprints, arXiv:2103.07829, 2021.
  19. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, volume 30, pages 121–137, 2020.
  20. MO-VLN: A multi-task benchmark for open-set zero-shot vision-and-language navigation. Computing Research Repository arXiv Preprints, arXiv:2306.10322, 2023.
  21. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3320–3329, 2021.
  22. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, pages 3889–3898, 2019.
  23. Focal loss for dense object detection. In Proceedings of the 16th IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
  24. Q. Liu and Z. Wang. Progressive boundary refinement network for temporal action detection. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, number 07, pages 11612–11619, 2020.
  25. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, pages 5427–5441, 2022.
  26. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. Computing Research Repository arXiv Preprints, arXiv:1411.2539, 2019.
  27. Unified-IO: A unified model for vision, language, and multi-modal tasks. Computing Research Repository arXiv Preprints, arXiv:2206.08916, 2022.
  28. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  29. Shaping visual representations with language for few-shot classification. Computing Research Repository arXiv Preprints, arXiv:1911.02683, 2019.
  30. Zero-shot temporal action detection via vision-language prompting. In Proceedings of the 17th European Conference on Computer Vision, volume 3, pages 681–697, 2022.
  31. Meta learning to bridge vision and language models for multimodal few-shot learning. Computing Research Repository arXiv Preprints, arXiv:2302.14794, 2023.
  32. OpenAI. ChatGPT 3.5, 2022. URL: https://chat.openai.com/. Accessed: 2024-January.
  33. Combined scaling for zero-shot transfer learning. Computing Research Repository arXiv Preprints, arXiv:2111.10050, 2021.
  34. Multimodal open-vocabulary video classification via pre-trained vision and language models. Computing Research Repository arXiv Preprints, arXiv:2207.07646, 2022.
  35. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 485–494, 2021.
  36. Learning transferable visual models from natural language supervision. In Proceedings of the 2021 International Conference on Machine Learning, pages 8748–8763, 2021.
  37. Open-vocabulary temporal action detection with off-the-shelf image-text features. Computing Research Repository arXiv Preprints, arXiv:2212.10596, 2022.
  38. Are vision-language transformers learning multimodal representations? A probing perspective. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, number 10, pages 11248–11257, 2022.
  39. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023.
  40. React: Temporal action detection with relational queries. In Proceedings of the 17th European Conference on Computer Vision, volume 10, pages 105–121, 2022.
  41. Attention is all you need. Advances in Neural Information Processing Systems, 30:6000–6010, 2017.
  42. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the 2017 IEEE conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017.
  43. Temporal action proposal generation with transformers. Computing Research Repository arXiv Preprints, arXiv:2105.12043, 2021.
  44. ActionCLIP: A new paradigm for video action recognition. Computing Research Repository arXiv Preprints, arXiv:2109.08472, 2021.
  45. Transforming CLIP to an open-vocabulary video model via interpolated weight optimization. Computing Research Repository arXiv Preprints, arXiv:2302.00624, 2023.
  46. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020.
  47. Challenges of zero-shot recognition with vision-language models: Granularity and correctness. Computing Research Repository arXiv Preprints, arXiv:2306.16048, 2023.
  48. Open-vocabulary object detection using captions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
  49. Graph convolutional networks for temporal action localization. In Proceedings of the 17th IEEE/CVF International Conference on Computer Vision, pages 7094–7103, 2019.
  50. ActionFormer: Localizing moments of actions with transformers. In Proceedings of the 17th European Conference on Computer Vision, volume 4, pages 492–510, 2022.
  51. Video self-stitching graph network for temporal action localization. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021.
  52. Bottom-up temporal action localization with mutual regularization. In Proceedings of the 16th European Conference on Computer Vision, volume 8, pages 539–555, 2020.
  53. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, number 07, pages 12993–13000, 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.