Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations (2009.04965v3)

Published 10 Sep 2020 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection system. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel modules, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets. The source code is available at https://github.com/coldmanck/RVL-BERT.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Meng-Jiun Chiou (5 papers)
  2. Roger Zimmermann (76 papers)
  3. Jiashi Feng (295 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.