YouTube-8M: A Large-Scale Video Classification Benchmark (1609.08675v1)

Published 27 Sep 2016 in cs.CV

Abstract: Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, there are no comparable size video classification datasets. In this paper, we introduce YouTube-8M, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities. To get the videos and their labels, we used a YouTube video annotation system, which labels videos with their main topics. While the labels are machine-generated, they have high-precision and are derived from a variety of human-based signals including metadata and query click signals. We filtered the video labels (Knowledge Graph entities) using both automated and manual curation strategies, including asking human raters if the labels are visually recognizable. Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and report them as baselines. Despite the size of the dataset, some of our models train to convergence in less than a day on a single machine using TensorFlow. We plan to release code for training a TensorFlow model and for computing metrics.

Authors (7)

Sami Abu-El-Haija (23 papers)
Nisarg Kothari (2 papers)
Joonseok Lee (39 papers)
Paul Natsev (5 papers)
George Toderici (22 papers)
Balakrishnan Varadarajan (9 papers)
Sudheendra Vijayanarasimhan (15 papers)

Citations (1,224)

View on Semantic Scholar

Summary

The paper introduces the YouTube-8M dataset, comprising 8 million videos with 4800 visual annotations, and sets a new standard for large-scale video classification research.
It details a novel methodology using pre-computed deep CNN frame-level features and evaluates models such as logistic regression, DBoF, and LSTM.
The dataset demonstrates significant improvements via transfer learning, with enhanced mAP on benchmarks like ActivityNet and competitive results on Sports-1M.

YouTube-8M: A Large-Scale Video Classification Benchmark

The paper "YouTube-8M: A Large-Scale Video Classification Benchmark" by Google Research introduces a substantial dataset aimed at advancing the underexplored domain of large-scale video classification. This dataset, termed YouTube-8M, comprises approximately 8 million videos spanning 500,000 hours, annotated using a vocabulary of 4800 visual entities. The paper provides an inclusive overview of the dataset's creation process, subsequent feature extraction, and benchmark performance using various state-of-the-art techniques.

Dataset Construction and Features

The dataset's construction involved leveraging the YouTube video annotation system, which automatically labels videos by deriving high-precision annotations from human signals like metadata and query clicks. Subsequently, these labels were filtered for visual recognizability through both automated and manual curation, yielding a high-confidence, visual set of entities.

For each video, frame-level features were extracted using a Deep CNN pre-trained on ImageNet, decoding each video at one frame per second. The resulting features were then compressed, leading to a dataset containing frame-level features for over 1.9 billion video frames. The processed dataset allows practical model training on a single machine within feasible time constraints.

Classification Models and Baselines

The authors experimented with various classification models to evaluate the dataset. These models were trained using TensorFlow, a publicly-available framework. Among the evaluated approaches, the paper explored:

Logistic Regression with Averaging: Initial frame-level predictions are averaged to yield video-level predictions.
Deep Bag of Frames (DBoF): This technique involves up-projection of frame-level features into a higher dimensional space, followed by max pooling.
LSTM Networks: Long Short-Term Memory networks were employed to capture temporal dependencies across frames.

Interestingly, despite the dataset's size, convergence was achievable within a day. This demonstrates the computational practicality enabled by pre-computation of deep features.

Performance and Transfer Learning

The dataset's efficiency was tested not only on itself but also using transfer learning to other benchmarks like Sports-1M and ActivityNet. The results indicate substantial improvements:

On ActivityNet, the transfer learning improved mAP from 53.8% to 77.6%.
Models trained on YouTube-8M showed exceptional generalization to Sports-1M, with performances comparable to state-of-the-art approaches utilizing raw video data and motion features.

Implications and Future Directions

The practical and theoretical implications of YouTube-8M are significant. Practically, its scale and diversity allow training more generalized models, which reduces overfitting and enhances robustness to different video types. Theoretically, the dataset encourages research into handling noisy and incomplete labels, as well as exploring deeper video representation learning techniques that can effectively utilize pre-computed deep features.

Conclusion

"YouTube-8M: A Large-Scale Video Classification Benchmark" establishes a substantial advance in video classification datasets. With its scope, size, and the public availability of pre-computed features, it levels the playing field for researchers, enabling rapid and scalable video understanding research. Future developments may focus on augmenting the dataset with audio and motion features and improving algorithmic architectures to exploit its full potential.

The YouTube-8M dataset is poised to serve as a significant resource for future advancements in video understanding, representation learning, and related fields in AI and machine learning.

PDF Markdown