Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding (2401.00901v2)

Published 31 Dec 2023 in cs.CV

Abstract: Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies, our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$ accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. End-to-end object detection with transformers. In ECCV, 2020.
  2. What, when, and where? – self-supervised spatio-temporal grounding in untrimmed multi-action videos from narrated instructions. arXiv preprint arXiv:2303.16990, 2023a.
  3. Localizing natural language in videos. In AAAI, 2019a.
  4. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  5. Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549, 2019b.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
  7. Tall: Temporal activity localization via language query. In ICCV, 2017.
  8. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  9. Embracing consistency: A one-stage approach for spatio-temporal video grounding. In NeurIPS, 2022.
  10. Grounded language-image pre-training. In CVPR, 2022.
  11. Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In CVPR, 2023.
  12. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  13. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  14. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  15. Learning transferable visual models from natural language supervision. In ICML, 2021.
  16. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
  17. Generalized intersection over union. In CVPR, 2019.
  18. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV, 2020.
  19. Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
  20. Annotating objects and relations in user-generated videos. In ICMR, 2019.
  21. Stvgbert: A visual- linguistic transformer based framework for spatio-temporal video grounding. In ICCV, 2021.
  22. Augmented 2d-tan: A two-stage approach for human-centric spatio-temporal video grounding. arXiv preprint arXiv:2106.10634, 2021a.
  23. Look at what i’m doing: Self-supervised spatial grounding of narrations in instructional videos, 2021b.
  24. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  25. Attention is all you need. In NeurIPS, 2017.
  26. Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, 2022.
  27. Spatio-temporal person retrieval via natural language queries. In ICCV, 2017.
  28. Tubedetr: Spatio-temporal video grounding with transformers. In CVPR, 2022.
  29. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  30. 2rd place solutions in the hc-stvg track of person in context challenge 2021. arXiv preprint arXiv:2106.07166, 2021.
  31. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
  32. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, 2020.
  33. Object-aware multi-branch relation networks for spatio-temporal video grounding. In IJCAI, 2021.
  34. Weakly-supervised video object grounding from text by loss weighting and object interaction. In BMVC, 2018.
  35. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Syed Talal Wasim (11 papers)
  2. Muzammal Naseer (67 papers)
  3. Salman Khan (244 papers)
  4. Ming-Hsuan Yang (377 papers)
  5. Fahad Shahbaz Khan (225 papers)
Citations (2)

Summary

Introduction

Spatio-temporal video grounding plays a critical role in interpreting and linking visual content with descriptive natural language. Traditional models in this domain have primarily operated under a closed-set setting, relying on precise training datasets with a pre-defined and limited vocabulary. However, these models often falter when exposed to new visual and conceptual variations beyond the scope of their training data, a phenomenon frequently observed in real-world applications.

Open-Vocabulary Spatio-Temporal Video Grounding

To tackle the limitations posed by the closed vocabulary in existing spatio-temporal video grounding methods, a novel paradigm has been introduced that embraces the concept of open-vocabulary video grounding. This methodology aims to enable training models on a set of base categories and to facilitate their generalization to unseen objects and actions. By incorporating pre-trained representations from spatial grounding models engineered on extensive image-text datasets, this new approach exhibits remarkable generalization capabilities, successfully performing in scenarios where traditional models would typically underperform.

Model Architecture

The proposed model adopts a DETR-like architecture complemented with temporal aggregation modules further enhancing its capabilities. The spatial modules derive their initialization from a pre-trained foundational image model, ensuring the retention of nuanced representation aids in the model's generalization prowess. The architecture encompasses vision and text encoders, a cross-modality spatio-temporal encoder that fuses spatial, temporal, and modal information, and language-guided query selection to initialize cross-modal queries efficiently. A decoder then processes these queries, predicting bounding boxes and corresponding temporal tubes, leveraging a rich set of features extracted from the vision and text encoders.

Advancements and Contributions

The proposed video grounding model achieves significant advancements by delivering remarkable performance in both closed-set and open-vocabulary settings. The model consistently surpasses state-of-the-art methods across multiple benchmarks. Notably, it performs impressively when evaluated in an open-vocabulary setting on HC-STVG V1 and YouCook-Interactions benchmarks, showcasing improvement over recent best-performing models. These achievements underscore the efficacy of the approach in handling diverse linguistic and visual concepts, leading to an enhanced understanding of videos.

Conclusion

The paper's contribution to the field of video grounding is manifold, including a pioneering evaluation of spatio-temporal video grounding models in an open-vocabulary setting and a novel model that merges spatial grounding strengths with video-specific adaptability. These enhancements allow the model to not only exceed current closed-set standards but also to bravely navigate open-vocabulary challenges, signifying a promising step forward in the ever-evolving landscape of video understanding.

X Twitter Logo Streamline Icon: https://streamlinehq.com