OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (1906.00067v2)

Published 31 May 2019 in cs.CV and cs.CL

Abstract: Visual Question Answering (VQA) in its ideal form lets us study reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most VQA benchmarks to date are focused on questions such as simple counting, visual attributes, and object detection that do not require reasoning or knowledge beyond what is in the image. In this paper, we address the task of knowledge-based visual question answering and provide a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources. Our new dataset includes more than 14,000 questions that require external knowledge to answer. We show that the performance of the state-of-the-art VQA models degrades drastically in this new setting. Our analysis shows that our knowledge-based VQA task is diverse, difficult, and large compared to previous knowledge-based VQA datasets. We hope that this dataset enables researchers to open up new avenues for research in this domain. See http://okvqa.allenai.org to download and browse the dataset.

PDF Abstract

An Essay on "OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge"

In the pursuit of advancing the field of Visual Question Answering (VQA), the paper titled "OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge" introduces a critical benchmark - OK-VQA, that specifically addresses the need for incorporating external knowledge to answer questions about images. This paper focuses on transcending the limitations of existing VQA datasets which predominantly rely on the visual content of the images and do not adequately test the reasoning and knowledge integration capabilities of AI models.

Summary and Methodology

The authors, Kenneth Marino et al., identified a significant gap in the current VQA landscape: the lack of benchmarks that require models to access and reason with external knowledge sources. Traditional VQA datasets mainly consist of questions answerable by direct visual features alone, such as counting objects, identifying colors, and recognizing simple attributes. In contrast, OK-VQA requires answers derived from expansive external knowledge beyond the given image.

OK-VQA contains more than 14,000 questions classified into diverse knowledge categories such as science, history, and sports. The dataset is built on images randomly selected from the COCO dataset, ensuring high visual complexity. To generate questions, Mechanical Turk (MTurk) workers were prompted to develop queries demanding external information. Further filtration ensured the questions genuinely required outside knowledge, defining the benchmark's stringent demands.

Benchmarking and Results

The benchmark involved evaluating several state-of-the-art VQA models on the OK-VQA dataset, including Multimodal Tucker Fusion (MUTAN) and Bilinear Attention Networks (BAN). Additionally, the authors introduced ArticleNet, a rudimentary knowledge-based model utilizing Wikipedia articles to retrieve relevant information for answering questions.

Key findings include:

State-of-the-art models such as BAN and MUTAN showcased drastic performance degradation on OK-VQA compared to traditional datasets. For instance, BAN achieved 25.17\% and MUTAN 26.41\% accuracy, establishing the complexity of questions posed in OK-VQA.
A combination of traditional VQA models with ArticleNet, leveraging sentence retrieval from Wikipedia, demonstrated slight performance improvements. MUTAN combined with ArticleNet (MUTAN + AN) managed a modest accuracy increase to 27.84\%.
Importantly, oracle results combining raw predictions from VQA models and ArticleNet hint at a potential upper-bound improvement, underscoring the need for more sophisticated knowledge retrieval mechanisms.

Implications and Future Developments

The implications of this research are manifold. OK-VQA sets a higher bar for the integration of external knowledge in VQA tasks, moving a step closer to mirroring human-like reasoning and comprehension. This benchmark illuminates the substantial gap in current methodologies concerning knowledge-based reasoning and opens avenues for developing more sophisticated AI systems capable of accessing, retrieving, and integrating vast external information.

The introduction of OK-VQA challenges the research community to rethink and innovate in the areas of information retrieval, multi-modal learning, and knowledge integration. In terms of future research, focus areas likely include:

Developing more advanced retrieval models that can efficiently parse, understand, and utilize unstructured data from large-scale knowledge sources.
Enhancing multimodal fusion techniques to better integrate visual and retrieved textual data for coherent answer generation.
Exploring more extensive and diverse external knowledge databases to ensure comprehensive coverage of possible questions.

In summary, the paper introduces the OK-VQA benchmark, pushing the boundaries of VQA by requiring external knowledge, and thereby laying the groundwork for more intelligent, knowledge-aware AI systems. The benchmark's introduction is a call to action for the research community to innovate at the intersection of vision, language, and external knowledge integration, steering towards more robust and capable AI.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Kenneth Marino (15 papers)
Mohammad Rastegari (57 papers)
Ali Farhadi (138 papers)
Roozbeh Mottaghi (66 papers)

Citations (862)

View on Semantic Scholar

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (1906.00067v2)

An Essay on "OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge"

Summary and Methodology

Benchmarking and Results

Implications and Future Developments

Related Papers