Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Vision-Language Transformers from Captions (2205.09256v3)

Published 19 May 2022 in cs.CV and cs.MM

Abstract: Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Liangke Gui (8 papers)
  2. Yingshan Chang (10 papers)
  3. Qiuyuan Huang (23 papers)
  4. Subhojit Som (9 papers)
  5. Alex Hauptmann (7 papers)
  6. Jianfeng Gao (344 papers)
  7. Yonatan Bisk (91 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.