AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding (2406.07091v1)

Published 11 Jun 2024 in cs.CV

Abstract: Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To be specific, AutoTVG consists of a novel Captioned Moment Generation (CMG) module to generate captioned moments from untrimmed videos, and TVGNet with a regression head to predict localization results. Experimental results on Charades-STA and ActivityNet Captions show that, regarding zero-shot temporal video grounding, AutoTVG achieves highly competitive performance with in-distribution methods under out-of-distribution testing, and is superior to existing pre-training frameworks with much less training data.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (9)

Xing Zhang (104 papers)
Jiaxi Gu (17 papers)
Haoyu Zhao (41 papers)
Shicong Wang (2 papers)
Hang Xu (204 papers)
Renjing Pei (25 papers)
Songcen Xu (41 papers)
Zuxuan Wu (144 papers)
Yu-Gang Jiang (223 papers)

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding (2406.07091v1)

Related Papers