HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips (1906.03327v2)

Published 7 Jun 2019 in cs.CV

Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

Authors (6)

Antoine Miech (23 papers)
Dimitri Zhukov (3 papers)
Jean-Baptiste Alayrac (38 papers)
Makarand Tapaswi (41 papers)
Ivan Laptev (99 papers)
Josef Sivic (78 papers)

Citations (1,071)

View on Semantic Scholar

Summary

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

The paper presents a comprehensive paper on learning joint text-video embeddings from a massive dataset of narrated instructional videos, named HowTo100M. This dataset comprises 136 million video clips sourced from 1.22 million narrated instructional web videos, which encapsulate over 23,000 distinct visual tasks.

Contributions

The work makes several critical contributions:

Creation of HowTo100M Dataset: The authors introduce a scalable and fast data collection technique that retrieves instructional videos from platforms like YouTube. These videos, which are accompanied by narrations, provide a naturally aligned source of video and text data without needing any additional manual annotation. This results in a dataset that is orders of magnitude larger than existing ones, including YouCook2 and MSR-VTT.
Text-Video Embedding Model: The authors propose a model that maps videos and their corresponding captions into a shared embedding space. Utilizing a max-margin ranking loss, this model is trained to ensure that related text and video clips are close in this embedding space.
State-of-the-Art Performance: The embeddings learned from HowTo100M notably outperform existing methods on various benchmarks. Evaluations on text-to-video retrieval tasks and action localization problem domains (e.g., YouCook2, MSR-VTT, and CrossTask) illustrate the strength of the embeddings, both directly and after fine-tuning on the respective datasets.

Methodology and Model

The proposed model leverages a combination of 2D and 3D CNNs to extract features from video clips. Text features are derived from pretrained word embeddings, further processed through a shallow neural network. The joint embedding space is constructed using a non-linear transformation guided by a max-margin ranking loss. Importantly, the model incorporates both intra- and inter-video negative sampling strategies to ensure robustness in representing fine-grained visual information.

Experimental Results

Empirical validation of the proposed method underscores its effectiveness. Notably, the embedding model:

Achieves a recall rate of 33.6% on the CrossTask dataset, surpassing current state-of-the-art methods and even outdoing a fully supervised baseline trained on manually annotated segments.
On the MSR-VTT dataset, the HowTo100M pre-trained model, after fine-tuning, attains 52.8% recall at R@10, exceeding previous best results.
In domains less represented in the training data, such as movies in the LSMDC dataset, fine-tuning the pre-trained embeddings resulted in competitive performance improvements.

Implications and Future Work

The work presents several practical and theoretical implications:

Scalability: The approach demonstrates that leveraging large-scale, weakly-supervised datasets can significantly benefit learning robust multi-modal representations.
Transfer Learning: The fine-tuning results indicate that pre-trained embeddings on HowTo100M generalize well across various domains, necessitating fewer labeled examples for subsequent tasks.
Dataset Creation: The data collection process hinges on freely available web resources, highlighting an effective route to circumvent the expensive and time-consuming process of manual annotation.

Future research could delve into the following areas:

Extended Domains: Investigating the extension of this approach to other domains of web videos or even other forms of instructional media.
Enhanced Sampling Techniques: Improving positive and negative pair sampling techniques could further boost performance, especially in noisy data environments.
Multi-modal Pre-training: Incorporating more diverse and contextually enriched pre-training objectives could enhance the robustness of the embeddings.

In sum, this paper sets a new benchmark for large scale learning of text-video embeddings, providing significant insights and resources for the research community in computer vision and natural language processing.

PDF Markdown

Related Papers

Find Related Papers