Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Object Detection from Captions via Textual Scene Attributes (2009.14558v1)

Published 30 Sep 2020 in cs.CV

Abstract: Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets, outperforming recent approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Achiya Jerbi (4 papers)
  2. Roei Herzig (34 papers)
  3. Jonathan Berant (107 papers)
  4. Gal Chechik (110 papers)
  5. Amir Globerson (87 papers)
Citations (21)

Summary

We haven't generated a summary for this paper yet.