Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Textually Enriched Neural Module Networks for Visual Question Answering (1809.08697v1)

Published 23 Sep 2018 in cs.CL and cs.CV

Abstract: Problems at the intersection of language and vision, like visual question answering, have recently been gaining a lot of attention in the field of multi-modal machine learning as computer vision research moves beyond traditional recognition tasks. There has been recent success in visual question answering using deep neural network models which use the linguistic structure of the questions to dynamically instantiate network layouts. In the process of converting the question to a network layout, the question is simplified, which results in loss of information in the model. In this paper, we enrich the image information with textual data using image captions and external knowledge bases to generate more coherent answers. We achieve 57.1% overall accuracy on the test-dev open-ended questions from the visual question answering (VQA 1.0) real image dataset.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Khyathi Raghavi Chandu (24 papers)
  2. Mary Arpita Pyreddy (2 papers)
  3. Matthieu Felix (3 papers)
  4. Narendra Nath Joshi (3 papers)
Citations (6)