Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining (2305.14281v2)

Published 23 May 2023 in cs.CL and cs.CV

Abstract: Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions. With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Emanuele Bugliarello (27 papers)
  2. Aida Nematzadeh (24 papers)
  3. Lisa Anne Hendricks (37 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.