Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks (2209.07526v2)

Published 15 Sep 2022 in cs.CV

Abstract: This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-LLMing into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Junke Wang (18 papers)
  2. Dongdong Chen (164 papers)
  3. Zuxuan Wu (144 papers)
  4. Chong Luo (58 papers)
  5. Luowei Zhou (31 papers)
  6. Yucheng Zhao (28 papers)
  7. Yujia Xie (29 papers)
  8. Ce Liu (51 papers)
  9. Yu-Gang Jiang (223 papers)
  10. Lu Yuan (130 papers)
Citations (135)