A Straightforward Framework For Video Retrieval Using CLIP (2102.12443v2)

Published 24 Feb 2021 in cs.CV

Abstract: Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (3)

Jesús Andrés Portillo-Quintero (1 paper)
José Carlos Ortiz-Bayliss (2 papers)
Hugo Terashima-Marín (4 papers)

Citations (109)

View on Semantic Scholar

A Straightforward Framework For Video Retrieval Using CLIP (2102.12443v2)

Related Papers