Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visuo-Linguistic Question Answering (VLQA) Challenge (2005.00330v3)

Published 1 May 2020 in cs.CV, cs.AI, and cs.CL

Abstract: Understanding images and text together is an important aspect of cognition and building advanced AI systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and NLP systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shailaja Keyur Sampat (10 papers)
  2. Yezhou Yang (119 papers)
  3. Chitta Baral (152 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub