Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners (2212.04979v3)

Published 9 Dec 2022 in cs.CV, cs.LG, and cs.MM

Abstract: We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shen Yan (47 papers)
  2. Tao Zhu (205 papers)
  3. Zirui Wang (83 papers)
  4. Yuan Cao (201 papers)
  5. Mi Zhang (85 papers)
  6. Soham Ghosh (24 papers)
  7. Yonghui Wu (115 papers)
  8. Jiahui Yu (65 papers)
Citations (36)