Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (2104.00650v2)

Published 1 Apr 2021 in cs.CV

Abstract: Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval. The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute. We address both these challenges in this paper. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. The model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. It is trained with a curriculum learning schedule that begins by treating images as 'frozen' snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet. Despite training on datasets that are an order of magnitude smaller, we show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

PDF Abstract

A Joint Video and Image Encoder for End-to-End Retrieval

The research paper "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval" provides a comprehensive exploration of video-text retrieval through the creation of a unified visual encoder. The goal is to bridge the gap between images and videos in an end-to-end trainable model under a unified framework. This paper addresses several core challenges faced in video-text retrieval, particularly the design of the visual architecture and the noise in large-scale training datasets.

Model Architecture

The proposed model employs an adaptation of Vision Transformers (ViT) and Timesformer architectures, integrating attention mechanisms across both spatial and temporal dimensions. This dual encoder consists of a space-time transformer encoder for videos and a text encoder, capable of handling text-to-video and text-to-image retrieval. The primary innovation lies in the flexibility of this architecture, which allows it to train on images treated as single-frame videos, thus seamlessly unifying image and video datasets.

This model's backbone is the space-time transformer encoder, which divides inputs into spatio-temporal patches processed through a sequence of transformer layers. Positional embeddings are introduced to encode spatial and temporal information, enabling the model to contextualize patches both spatially and temporally. The text encoder adopted is based on a multi-layer bidirectional transformer, such as DistilBERT, which processes tokenized word sequences into text embeddings. These text and video embeddings are projected into a common space for retrieval via dot product-based similarity.

Training Strategy

In addressing the noisy nature of large-scale datasets like HowTo100M, the authors propose a pedagogically structured curriculum learning approach. Initially, the model treats images as 'frozen' video frames and gradually increases the temporal context when training on video datasets. This incremental learning strategy, supported by temporal embedding interpolation, enables the model to efficiently scale and improve performance while minimizing the required computational resources.

For video-text pretraining, the paper introduces WebVid-2M, a dataset comprising over 2 million videos with weakly aligned captions. Despite being substantially smaller than HowTo100M, WebVid-2M's higher quality and better-aligned captions allow the model to achieve state-of-the-art results. Complemented by large-scale image captioning datasets such as Conceptual Captions (CC3M), the training synergy between image and video data is leveraged effectively.

Experimental Results

Empirical validation is conducted on several benchmark datasets, including MSR-VTT, MSVD, DiDeMo, and LSMDC. Notable results include achieving a Recall@1 (R@1) of 32.5% and a Median Rank (MedR) of 3 on the MSR-VTT dataset, outperforming prior methods that utilized pre-extracted features from various modalities. The zero-shot performance on MSR-VTT showcases the model's generalizability, where it surpasses benchmarks achieved by earlier approaches reliant on extensive pretraining on noisy datasets like HowTo100M.

An ablation paper showed that the choice of video backbone significantly impacts performance, with the space-time transformer encoder notably enhancing retrieval efficiency compared to traditional 3D convolutional networks (3D CNNs). Similarly, distilbert-base-uncased as the text backbone optimally balances performance and computational efficiency.

Contributions and Future Directions

The paper's contributions are multifold:

Introduction of a versatile end-to-end trainable model for video retrieval without dependence on 'expert' pre-extracted features.
Development of a large-scale video-text pretraining dataset, WebVid-2M, validating that smaller, higher-quality datasets can outperform noisier, larger ones.
Implementation of a curriculum learning mechanism, improving training efficiency and model performance through gradual learning of temporal contexts.

Future developments in AI could explore the incorporation of multi-modal data inputs, such as audio and additional text sources, leveraging this joint model framework. Further scaling with more extensive and diverse datasets like WebVid-10M could push boundaries, enhancing the model's robustness and applicability across more complex video-text retrieval tasks.

This unified approach signifies a practical step towards efficient and flexible video-text retrieval, setting a precedent for integrated model architectures for multi-modal data grounding.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Max Bain (15 papers)
Arsha Nagrani (62 papers)
Gül Varol (39 papers)
Andrew Zisserman (248 papers)

Citations (964)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Max Bain