Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding (2401.00901v2)
Abstract: Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies, our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$ accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.
- End-to-end object detection with transformers. In ECCV, 2020.
- What, when, and where? – self-supervised spatio-temporal grounding in untrimmed multi-action videos from narrated instructions. arXiv preprint arXiv:2303.16990, 2023a.
- Localizing natural language in videos. In AAAI, 2019a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
- Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549, 2019b.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
- Tall: Temporal activity localization via language query. In ICCV, 2017.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Embracing consistency: A one-stage approach for spatio-temporal video grounding. In NeurIPS, 2022.
- Grounded language-image pre-training. In CVPR, 2022.
- Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In CVPR, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
- Generalized intersection over union. In CVPR, 2019.
- Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV, 2020.
- Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
- Annotating objects and relations in user-generated videos. In ICMR, 2019.
- Stvgbert: A visual- linguistic transformer based framework for spatio-temporal video grounding. In ICCV, 2021.
- Augmented 2d-tan: A two-stage approach for human-centric spatio-temporal video grounding. arXiv preprint arXiv:2106.10634, 2021a.
- Look at what i’m doing: Self-supervised spatial grounding of narrations in instructional videos, 2021b.
- Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Negative sample matters: A renaissance of metric learning for temporal grounding. In AAAI, 2022.
- Spatio-temporal person retrieval via natural language queries. In ICCV, 2017.
- Tubedetr: Spatio-temporal video grounding with transformers. In CVPR, 2022.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- 2rd place solutions in the hc-stvg track of person in context challenge 2021. arXiv preprint arXiv:2106.07166, 2021.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, 2020.
- Object-aware multi-branch relation networks for spatio-temporal video grounding. In IJCAI, 2021.
- Weakly-supervised video object grounding from text by loss weighting and object interaction. In BMVC, 2018.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
- Syed Talal Wasim (11 papers)
- Muzammal Naseer (67 papers)
- Salman Khan (244 papers)
- Ming-Hsuan Yang (377 papers)
- Fahad Shahbaz Khan (225 papers)