Contrastive Language-Action Pre-training for Temporal Localization (2204.12293v1)

Published 26 Apr 2022 in cs.CV

Abstract: Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Mengmeng Xu (27 papers)
Erhan Gundogdu (9 papers)
Maksim Lapin (6 papers)
Bernard Ghanem (255 papers)
Michael Donoser (3 papers)
Loris Bazzani (14 papers)

Citations (27)

View on Semantic Scholar

Contrastive Language-Action Pre-training for Temporal Localization (2204.12293v1)

Related Papers