Papers
Topics
Authors
Recent
Search
2000 character limit reached

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Published 7 Jun 2019 in cs.CV | (1906.03327v2)

Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

Citations (1,071)

Summary

  • The paper introduces the HowTo100M dataset and a joint text-video embedding model learned from 136M narrated instructional clips.
  • It employs a max-margin ranking loss with innovative intra- and inter-video negative sampling to robustly align visual and textual features.
  • State-of-the-art performance on benchmarks like CrossTask and MSR-VTT demonstrates its strong potential for transfer learning across diverse domains.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

The paper presents a comprehensive study on learning joint text-video embeddings from a massive dataset of narrated instructional videos, named HowTo100M. This dataset comprises 136 million video clips sourced from 1.22 million narrated instructional web videos, which encapsulate over 23,000 distinct visual tasks.

Contributions

The work makes several critical contributions:

  1. Creation of HowTo100M Dataset: The authors introduce a scalable and fast data collection technique that retrieves instructional videos from platforms like YouTube. These videos, which are accompanied by narrations, provide a naturally aligned source of video and text data without needing any additional manual annotation. This results in a dataset that is orders of magnitude larger than existing ones, including YouCook2 and MSR-VTT.
  2. Text-Video Embedding Model: The authors propose a model that maps videos and their corresponding captions into a shared embedding space. Utilizing a max-margin ranking loss, this model is trained to ensure that related text and video clips are close in this embedding space.
  3. State-of-the-Art Performance: The embeddings learned from HowTo100M notably outperform existing methods on various benchmarks. Evaluations on text-to-video retrieval tasks and action localization problem domains (e.g., YouCook2, MSR-VTT, and CrossTask) illustrate the strength of the embeddings, both directly and after fine-tuning on the respective datasets.

Methodology and Model

The proposed model leverages a combination of 2D and 3D CNNs to extract features from video clips. Text features are derived from pretrained word embeddings, further processed through a shallow neural network. The joint embedding space is constructed using a non-linear transformation guided by a max-margin ranking loss. Importantly, the model incorporates both intra- and inter-video negative sampling strategies to ensure robustness in representing fine-grained visual information.

Experimental Results

Empirical validation of the proposed method underscores its effectiveness. Notably, the embedding model:

  • Achieves a recall rate of 33.6% on the CrossTask dataset, surpassing current state-of-the-art methods and even outdoing a fully supervised baseline trained on manually annotated segments.
  • On the MSR-VTT dataset, the HowTo100M pre-trained model, after fine-tuning, attains 52.8% recall at R@10, exceeding previous best results.
  • In domains less represented in the training data, such as movies in the LSMDC dataset, fine-tuning the pre-trained embeddings resulted in competitive performance improvements.

Implications and Future Work

The work presents several practical and theoretical implications:

  1. Scalability: The approach demonstrates that leveraging large-scale, weakly-supervised datasets can significantly benefit learning robust multi-modal representations.
  2. Transfer Learning: The fine-tuning results indicate that pre-trained embeddings on HowTo100M generalize well across various domains, necessitating fewer labeled examples for subsequent tasks.
  3. Dataset Creation: The data collection process hinges on freely available web resources, highlighting an effective route to circumvent the expensive and time-consuming process of manual annotation.

Future research could explore the following areas:

  • Extended Domains: Investigating the extension of this approach to other domains of web videos or even other forms of instructional media.
  • Enhanced Sampling Techniques: Improving positive and negative pair sampling techniques could further boost performance, especially in noisy data environments.
  • Multi-modal Pre-training: Incorporating more diverse and contextually enriched pre-training objectives could enhance the robustness of the embeddings.

In sum, this paper sets a new benchmark for large scale learning of text-video embeddings, providing significant insights and resources for the research community in computer vision and natural language processing.

Paper to Video (Beta)

Whiteboard

Generating whiteboard...

This may take a few minutes.

Explain it Like I'm 14

What is this paper about?

This paper introduces HowTo100M, a huge collection of “how-to” videos from YouTube (like cooking, fixing cars, or crafting) and shows how to use them to teach computers to connect what they see in a video with the words people say about it. The goal is to make it easy for a computer to find the right video from a text search (like “how to change a tire”) and to spot where specific actions happen inside long videos (like “add oil to a car” at the right moment).

What questions did the researchers ask?

They focused on simple, practical questions:

  • Can a computer learn to match videos and text by watching millions of narrated “how-to” videos without people writing special captions by hand?
  • Will learning from a massive amount of (sometimes messy) real-world videos make the computer better at tasks like:
    • Finding the right video from a sentence (text-to-video search)?
    • Finding where a specific step happens inside a long video (action step localization)?
  • If we pretrain on HowTo100M and then fine-tune on smaller datasets, do we get better results than training on the small datasets alone?

How did they do it? (Explained simply)

Think of teaching a computer the way you might learn a new skill on YouTube: you watch someone do it while they explain what they’re doing.

  1. Building a giant dataset:
    • They gathered about 1.22 million instructional YouTube videos (over 15 years of total video time!), covering more than 23,000 tasks (like cooking, home repair, crafts, and more).
    • Each video has subtitles—either written by the uploader or generated automatically by speech recognition (ASR). These narrations are not perfect, but they’re good enough at scale.
    • They split each video into short clips and paired each clip with the subtitle line spoken during that moment, ending up with about 136 million clip–caption pairs.
  2. Teaching the computer a shared “language” for video and text:
    • They built a model that puts both video clips and short text captions onto the same “map” (called an embedding). On this map, matching video–text pairs should end up close together, and mismatched pairs should be far apart.
    • Imagine labeling photos and sentences with coordinates so that “crack an egg into a bowl” lands near video clips where someone actually cracks an egg.
  3. How the model learns (everyday analogy):
    • The model plays a “hot or cold” game: it’s rewarded when it puts matching video and text close together (hot) and pushed to separate mismatches (cold).
    • To avoid taking shortcuts (like recognizing a kitchen background instead of the actual action), it also trains with “tricky negatives” from the same video. For example, if the correct line is “stir the sauce,” a negative might be “chop the onions” from the same kitchen scene. This forces the model to focus on the action and objects, not just the setting.
  4. Testing the model:
    • Text-to-video retrieval: Given a sentence, can the model find the right clip in a large set?
    • Action step localization: Given a list of steps for a task (like “jack up car,” “remove wheel,” “add oil”), can the model find when those steps happen in a long video?
    • They tested on standard datasets: YouCook2 (cooking), CrossTask (instructional tasks), MSR-VTT (random YouTube clips), and LSMDC (movie clips).

What did they find, and why is it important?

Here are the main takeaways:

  • Learning from massive, narrated “how-to” videos works really well:
    • On instructional tasks like YouCook2 and CrossTask, the model trained on HowTo100M achieved state-of-the-art results—often beating models trained on much smaller, carefully labeled datasets.
    • For action step localization on CrossTask, it even outperformed a fully supervised baseline on average, despite not using any manual step annotations.
  • It transfers to other kinds of videos:
    • Pretraining on HowTo100M and then fine-tuning on smaller datasets (like MSR-VTT or LSMDC) produced better results than training on those smaller datasets alone.
    • Even for very different content (like movie clips in LSMDC), pretraining helped once fine-tuned.
  • More data keeps helping:
    • As they increased the amount of HowTo100M training data, performance kept improving, with no signs of leveling off. This suggests that even more narrated videos could make models better.
  • Less manual work needed:
    • Because the narrations come “for free” with the videos, there’s no need for armies of people to write captions. This makes it fast and cheap to build very large training sets.

Why this matters:

  • Better video search: Imagine typing “how to knit a scarf” and instantly getting the most relevant clips.
  • Smarter assistants and robots: Systems that understand what people are doing and saying can learn tasks by watching.
  • Faster progress with fewer labels: Pretraining on large, naturally narrated videos reduces how much expensive manual labeling is required later.

What does this mean for the future?

This research shows a powerful strategy: let computers “learn by watching” at massive scale using everyday videos that already have spoken explanations. It opens the door to:

  • Easier training for many video-and-language tasks without costly human annotations.
  • More accurate and useful video search and summarization tools.
  • Improving AI that needs to understand and describe human actions, from household robots to educational apps.

In short, by turning the world’s how-to videos into training material, the paper demonstrates a practical and scalable way to teach machines to connect words with actions.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.