Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Audio-Video Modalities from Image Captions (2204.00679v1)

Published 1 Apr 2022 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Arsha Nagrani (62 papers)
  2. Paul Hongsuck Seo (29 papers)
  3. Bryan Seybold (11 papers)
  4. Anja Hauth (6 papers)
  5. Santiago Manen (2 papers)
  6. Chen Sun (187 papers)
  7. Cordelia Schmid (206 papers)
Citations (75)