Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries (1704.03944v2)

Published 12 Apr 2017 in cs.CV and stat.ML

Abstract: Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural LLMs trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuting Zhang (30 papers)
  2. Luyao Yuan (5 papers)
  3. Yijie Guo (31 papers)
  4. Zhiyuan He (15 papers)
  5. I-An Huang (1 paper)
  6. Honglak Lee (174 papers)
Citations (57)