YouTube-100M (HowTo100M) Dataset
- YouTube-100M (HowTo100M) dataset is a large-scale video-text resource comprising 136 million clip-caption pairs from 1.22 million instructional YouTube videos, supporting joint visual–textual learning.
- The dataset is constructed via an automated pipeline using WikiHow queries, YouTube search filtering, caption alignment, and deep feature extraction from both 2D and 3D CNNs.
- Empirical evaluations show state-of-the-art performance in video-text retrieval benchmarks, highlighting its efficacy despite challenges like noisy captions and domain bias.
The YouTube-100M Dataset, formally known as HowTo100M, is a large-scale corpus consisting of 136 million video clips automatically aligned with natural language narrations extracted from 1.22 million YouTube instructional videos. Specifically designed to support research on joint visual–textual learning, the dataset provides massive weak supervision for training powerful text–video embedding models without manual annotation (Miech et al., 2019).
1. Data Collection and Preprocessing
The foundational source for HowTo100M is the set of over 120,000 “how to” articles from WikiHow. Articles involving explicit physical manipulation were selected by restricting to 12 broad domains such as Food, Home & Garden, and Crafts, while filtering out abstract tasks (e.g., “be,” “know”). This procedure produced a set of 23,611 unique visual tasks.
For each task, a “how to <task name>” query was issued to YouTube, retrieving up to 200 top-ranked videos per task. Retained videos satisfied four criteria: presence of English subtitles (either human-provided or YouTube’s ASR), at least 100 views, at least 100 words in the subtitle track, and total duration no greater than 2,000 seconds. Deduplication was performed at the YouTube video ID level.
Caption alignment and preprocessing were data-driven. Each sequence of time-stamped subtitle lines was split, with each line interpreted as a (weak) caption. Each line’s time interval defined the start and end of its paired video clip. Captions were preprocessed by removing English stop words; remaining words were converted to 300-dimensional vectors using pre-trained GoogleNews word2vec embeddings. The average post-processed caption contained approximately four content words.
2. Dataset Composition and Statistical Properties
HowTo100M comprises 1.22 million unique YouTube videos, yielding a total of 136.6 million clip–caption pairs. Collectively, these represent 134,472 hours (about 15 years) of video data. The 23,611 identified tasks span 12 high-level domains, with the Food and Home & Garden categories providing the largest contributions (e.g., Food: 11,504 tutorials, 54.4M clips; Home & Garden: 5,068 tutorials, 29.5M clips).
Videos average 6.5 minutes, yielding approximately 110 clips per video. Each clip is around four seconds, and its caption averages four content words post-processing. All captions are in English; some are ASR or automatically translated variants.
Dataset noisiness is non-negligible. Manual inspection of 400 randomly selected clip–caption pairs revealed that ~51% of captions referenced objects/actions actually present in the clip. The remainder were off-topic (“don’t forget to subscribe”), anticipatory (“next I will…”), or ungrammatical. At the video level, ~71% were true instructional content, ~12% vlogs, and ~7% commercial or advertisement material; non-instructional videos were retained due to frequent object–mention correspondence.
3. Format, Distribution, and Licensing
Distributed materials include, for each of the approximately 136 million clips: YouTube video ID, clip start and end times, preprocessed caption text (stop words removed), and the corresponding WikiHow task category. Additionally, pre-computed deep video features, code for dataset re-creation, model training and evaluation, and pretrained embedding models are supplied.
Only metadata, features, and code are publicly released under an open academic license; raw video cannot be redistributed and must be re-downloaded from YouTube using provided IDs, subject to YouTube’s terms of service and each video’s individual license.
All official materials are hosted at https://www.di.ens.fr/willow/research/howto100m.
4. Joint Text–Video Embedding Framework
HowTo100M supports training of large-capacity joint text–video embedding models. The core approach involves two nonlinear mappings:
- for video features
- for caption features
Cosine similarity is optimized to reflect correct alignment between clip and caption.
Video representations concatenate features from two CNNs:
- 2D ResNet-152 (ImageNet-pretrained) sampled at 1 fps
- 3D ResNeXt-101 (Kinetics-pretrained, 16-frame) sampled at 1.5 fps Max-pooling yields a 4,096-dimensional feature vector per clip.
Textual input passes from subtitle lines, through word2vec embedding, to a shallow 1D CNN, outputting a corresponding 4,096-dimensional vector.
Both video and caption nonlinear mappings share a similar architecture:
where denotes element-wise multiplication and is a sigmoid gating. The total parameter count is approximately 67 million.
Training uses a symmetric max-margin ranking loss over clip–caption minibatches, mixing intra- and inter-video negatives with a margin of 0.1. Optimization employs Adam (learning rate ) for three days on a single Tesla P100 GPU.
5. Benchmarking and Empirical Results
Pretrained embeddings, obtained from HowTo100M, achieve state-of-the-art or highly competitive results across multiple standard video-text retrieval and localization benchmarks without task-specific fine-tuning, and further improve with target domain adaptation.
Summary of key empirical findings:
| Dataset | Prior Best | Ours (No FT) | Ours (FT) |
|---|---|---|---|
| CrossTask (Recall) | 22.4% (Zhukov et al.) | 33.6% | — |
| YouCook2 (R@10) | 21.6% (HGLMM-FV+CCA) | 24.8% | 35.3% |
| MSR-VTT (R@10) | 43.2% (JSFusion) | 29.6% | 52.8% |
| LSMDC (R@10) | 34.1% (JSFusion) | 14.0% | 27.9% |
In MSR-VTT retrieval, fine-tuning with just 20% of the target data suffices to reach state-of-the-art results, reflecting substantial data efficiency. Pre-training on HowTo100M yields systematically higher performance on all three retrieval benchmarks compared to models pretrained directly on those datasets (Miech et al., 2019).
6. Limitations and Practical Considerations
Supervision in HowTo100M is inherently noisy: captions originate from ASR monologue transcripts rather than independent human annotations, leading to significant incidence of misaligned, off-topic, or ungrammatical pairings. Nevertheless, model robustness is observed to emerge at scale.
Licensing restricts dataset distribution to non-video artifacts; users must directly re-acquire video data from YouTube, adhering to terms of service and video-specific licenses.
Deduplication is limited to unique YouTube IDs, leaving open the inclusion of semantically duplicate videos with differing IDs.
Intrinsic domain bias is present, given dominance of instructional categories such as cooking and DIY; out-of-domain generalization (e.g., to narrative/cinematic data like LSMDC) requires further fine-tuning but is largely recoverable with modest domain-specific training.
Filtering heuristics used during dataset construction, such as requiring ≥100 views, ≥100 subtitle words, length ≤2,000 seconds, and restricting to top 200 results per search, bias selection toward popular, higher-quality videos at the expense of uncurated diversity, but nonetheless leave considerable residual noise.
7. Relevance and Significance for Research
HowTo100M, the largest available video–text corpus, enables direct training of simple but effective joint text–video embedding models that outperform established baselines on instructional video understanding and retrieval. Its construction methodology—web-scale, automated, and annotation-free—has demonstrated both scalability and empirical benefit for downstream multimodal learning tasks. The dataset’s pretraining advantage consistently surpasses that of equivalent-size domain-specific corpora, facilitating model transfer and efficient task adaptation for the broader research community (Miech et al., 2019).