SnAG: Scalable and Accurate Video Grounding (2404.02257v2)

Published 2 Apr 2024 in cs.CV

Abstract: Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

References (71)

Authors (3)

Fangzhou Mu (18 papers)
Sicheng Mo (9 papers)
Yin Li (150 papers)

Citations (3)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/CSVisionPapers/status/1775794813666746694

SnAG: Scalable and Accurate Video Grounding (2404.02257v2)

Summary

Related Papers

Tweets