Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Task-driven Visual Saliency and Attention-based Visual Question Answering (1702.06700v1)

Published 22 Feb 2017 in cs.CV, cs.AI, cs.CL, and cs.NE

Abstract: Visual question answering (VQA) has witnessed great progress since May, 2015 as a classic problem unifying visual and textual data into a system. Many enlightening VQA works explore deep into the image and question encodings and fusing methods, of which attention is the most effective and infusive mechanism. Current attention based methods focus on adequate fusion of visual and textual features, but lack the attention to where people focus to ask questions about the image. Traditional attention based methods attach a single value to the feature at each spatial location, which losses many useful information. To remedy these problems, we propose a general method to perform saliency-like pre-selection on overlapped region features by the interrelation of bidirectional LSTM (BiLSTM), and use a novel element-wise multiplication based attention method to capture more competent correlation information between visual and textual features. We conduct experiments on the large-scale COCO-VQA dataset and analyze the effectiveness of our model demonstrated by strong empirical results.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuetan Lin (3 papers)
  2. Zhangyang Pang (1 paper)
  3. Donghui Wang (15 papers)
  4. Yueting Zhuang (164 papers)
Citations (26)