KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA (2012.11014v1)

Published 20 Dec 2020 in cs.CV and cs.CL

Abstract: One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models. Second, explicit, symbolic knowledge encoded in knowledge bases. Our approach combines both - exploiting the powerful implicit reasoning of transformer models for answer prediction, and integrating symbolic representations from a knowledge graph, while never losing their explicit semantics to an implicit embedding. We combine diverse sources of knowledge to cover the wide variety of knowledge needed to solve knowledge-based questions. We show our approach, KRISP (Knowledge Reasoning with Implicit and Symbolic rePresentations), significantly outperforms state-of-the-art on OK-VQA, the largest available dataset for open-domain knowledge-based VQA. We show with extensive ablations that while our model successfully exploits implicit knowledge reasoning, the symbolic answer module which explicitly connects the knowledge graph to the answer vocabulary is critical to the performance of our method and generalizes to rare answers.

PDF Abstract

Integrating Implicit and Symbolic Knowledge for VQA

The paper "KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA" presents an innovative approach to tackling one of the most challenging aspects of Visual Question Answering (VQA): answering questions that require external, domain-independent knowledge. The authors introduce a model named KRISP, which stands for Knowledge Reasoning with Implicit and Symbolic rePresentations. This model leverages the complementary strengths of implicit knowledge captured via transformer-based models and explicit knowledge encoded in symbolic knowledge bases.

Motivation and Approach

The motivation for KRISP stems from the limitations of existing VQA systems, which often rely heavily on the image and question parsing capabilities without adequately integrating external knowledge. This leads to challenges in scalability and biased learning when relying solely on image-question-answer triplets. KRISP addresses these challenges by combining two forms of knowledge representations:

Implicit Knowledge: Leveraging the power of pre-trained LLMs like BERT, KRISP infuses implicit reasoning capabilities into the model, enabling it to learn from large-scale, unsupervised language data. This enables the model to capture a wide array of linguistic nuances and contextual meanings.
Symbolic Knowledge: KRISP incorporates symbolic knowledge through knowledge graphs, ensuring that the explicit semantic information is maintained. By directly connecting symbols to the answer vocabulary, the symbolic module enhances the model's ability to generalize to rare answers which often require specific factual knowledge not commonly encountered during training.

The integration of these knowledge forms is achieved through a novel architecture that fuses multi-modal BERT-pretrained transformers with graph networks to utilize symbolic knowledge bases effectively.

Experimental Validation

KRISP demonstrated its efficacy on the Open Knowledge VQA (OK-VQA) dataset, which is designed to challenge models with questions that depend on external knowledge. The model set a new benchmark with significant performance improvements over previous state-of-the-art methods. Extensive ablation studies confirmed the critical role of the symbolic answer module and its ability to exploit the diversity of knowledge sources such as DBPedia, ConceptNet, and VisualGenome.

In addition, the paper highlighted the limitations of solely relying on implicit knowledge, emphasizing the need for a hybrid approach that can harness both implicit and explicit knowledge. Moreover, the authors' choice to retain symbolic knowledge until the prediction stage proved crucial in achieving the observed improvements.

Implications and Future Directions

The integration of implicit and symbolic reasoning as demonstrated by KRISP offers a promising direction for future VQA research. By effectively capturing a broader spectrum of knowledge and reasoning capabilities, KRISP lays the groundwork for more robust and adaptable VQA systems. The findings also suggest avenues for further exploration, such as enhancing the knowledge graph with more diverse datasets or refining the integration between symbolic and implicit representations.

Looking ahead, advancements in this domain could lead to VQA systems capable of tackling real-world tasks that demand intricate knowledge navigation and reasoning, such as medical diagnostics or historical data analysis. As AI continues to evolve, the interplay between different types of knowledge representations will be critical in developing systems that approach human-like understanding.

In conclusion, this paper makes a substantial contribution to the field of VQA by establishing a framework that not only improves performance on knowledge-based questions but also enriches the theoretical understanding of multimodal reasoning systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kenneth Marino (15 papers)
Xinlei Chen (106 papers)
Devi Parikh (129 papers)
Abhinav Gupta (178 papers)
Marcus Rohrbach (75 papers)

Citations (169)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos