HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces HowTo100M, a huge collection of “how-to” videos from YouTube (like cooking, fixing cars, or crafting) and shows how to use them to teach computers to connect what they see in a video with the words people say about it. The goal is to make it easy for a computer to find the right video from a text search (like “how to change a tire”) and to spot where specific actions happen inside long videos (like “add oil to a car” at the right moment).
What questions did the researchers ask?
They focused on simple, practical questions:
- Can a computer learn to match videos and text by watching millions of narrated “how-to” videos without people writing special captions by hand?
- Will learning from a massive amount of (sometimes messy) real-world videos make the computer better at tasks like:
- Finding the right video from a sentence (text-to-video search)?
- Finding where a specific step happens inside a long video (action step localization)?
- If we pretrain on HowTo100M and then fine-tune on smaller datasets, do we get better results than training on the small datasets alone?
How did they do it? (Explained simply)
Think of teaching a computer the way you might learn a new skill on YouTube: you watch someone do it while they explain what they’re doing.
- Building a giant dataset:
- They gathered about 1.22 million instructional YouTube videos (over 15 years of total video time!), covering more than 23,000 tasks (like cooking, home repair, crafts, and more).
- Each video has subtitles—either written by the uploader or generated automatically by speech recognition (ASR). These narrations are not perfect, but they’re good enough at scale.
- They split each video into short clips and paired each clip with the subtitle line spoken during that moment, ending up with about 136 million clip–caption pairs.
- Teaching the computer a shared “language” for video and text:
- They built a model that puts both video clips and short text captions onto the same “map” (called an embedding). On this map, matching video–text pairs should end up close together, and mismatched pairs should be far apart.
- Imagine labeling photos and sentences with coordinates so that “crack an egg into a bowl” lands near video clips where someone actually cracks an egg.
- How the model learns (everyday analogy):
- The model plays a “hot or cold” game: it’s rewarded when it puts matching video and text close together (hot) and pushed to separate mismatches (cold).
- To avoid taking shortcuts (like recognizing a kitchen background instead of the actual action), it also trains with “tricky negatives” from the same video. For example, if the correct line is “stir the sauce,” a negative might be “chop the onions” from the same kitchen scene. This forces the model to focus on the action and objects, not just the setting.
- Testing the model:
- Text-to-video retrieval: Given a sentence, can the model find the right clip in a large set?
- Action step localization: Given a list of steps for a task (like “jack up car,” “remove wheel,” “add oil”), can the model find when those steps happen in a long video?
- They tested on standard datasets: YouCook2 (cooking), CrossTask (instructional tasks), MSR-VTT (random YouTube clips), and LSMDC (movie clips).
What did they find, and why is it important?
Here are the main takeaways:
- Learning from massive, narrated “how-to” videos works really well:
- On instructional tasks like YouCook2 and CrossTask, the model trained on HowTo100M achieved state-of-the-art results—often beating models trained on much smaller, carefully labeled datasets.
- For action step localization on CrossTask, it even outperformed a fully supervised baseline on average, despite not using any manual step annotations.
- It transfers to other kinds of videos:
- Pretraining on HowTo100M and then fine-tuning on smaller datasets (like MSR-VTT or LSMDC) produced better results than training on those smaller datasets alone.
- Even for very different content (like movie clips in LSMDC), pretraining helped once fine-tuned.
- More data keeps helping:
- As they increased the amount of HowTo100M training data, performance kept improving, with no signs of leveling off. This suggests that even more narrated videos could make models better.
- Less manual work needed:
- Because the narrations come “for free” with the videos, there’s no need for armies of people to write captions. This makes it fast and cheap to build very large training sets.
Why this matters:
- Better video search: Imagine typing “how to knit a scarf” and instantly getting the most relevant clips.
- Smarter assistants and robots: Systems that understand what people are doing and saying can learn tasks by watching.
- Faster progress with fewer labels: Pretraining on large, naturally narrated videos reduces how much expensive manual labeling is required later.
What does this mean for the future?
This research shows a powerful strategy: let computers “learn by watching” at massive scale using everyday videos that already have spoken explanations. It opens the door to:
- Easier training for many video-and-language tasks without costly human annotations.
- More accurate and useful video search and summarization tools.
- Improving AI that needs to understand and describe human actions, from household robots to educational apps.
In short, by turning the world’s how-to videos into training material, the paper demonstrates a practical and scalable way to teach machines to connect words with actions.
Collections
Sign up for free to add this paper to one or more collections.