Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Video Representations from Textual Web Supervision (2007.14937v2)

Published 29 Jul 2020 in cs.CV

Abstract: Videos on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We evaluate the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pre-training video representations. Specifically, it outperforms all existing methods for self-supervised and cross-modal video representation learning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jonathan C. Stroud (5 papers)
  2. Zhichao Lu (52 papers)
  3. Chen Sun (187 papers)
  4. Jia Deng (93 papers)
  5. Rahul Sukthankar (39 papers)
  6. Cordelia Schmid (206 papers)
  7. David A. Ross (27 papers)
Citations (48)

Summary

We haven't generated a summary for this paper yet.