Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering (1801.07853v1)

Published 24 Jan 2018 in cs.CV

Abstract: Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhe Wang (574 papers)
  2. Xiaoyi Liu (41 papers)
  3. Liangjian Chen (10 papers)
  4. Limin Wang (221 papers)
  5. Yu Qiao (563 papers)
  6. Xiaohui Xie (84 papers)
  7. Charless Fowlkes (35 papers)
Citations (14)