Learning Conditioned Graph Structures for Interpretable Visual Question Answering (1806.07243v6)

Published 19 Jun 2018 in cs.CV

Abstract: Visual Question answering is a challenging problem requiring a combination of concepts from Computer Vision and Natural Language Processing. Most existing approaches use a two streams strategy, computing image and question features that are consequently merged using a variety of techniques. Nonetheless, very few rely on higher level image representations, which can capture semantic and spatial relationships. In this paper, we propose a novel graph-based approach for Visual Question Answering. Our method combines a graph learner module, which learns a question specific graph representation of the input image, with the recent concept of graph convolutions, aiming to learn image representations that capture question specific interactions. We test our approach on the VQA v2 dataset using a simple baseline architecture enhanced by the proposed graph learner module. We obtain promising results with 66.18% accuracy and demonstrate the interpretability of the proposed method. Code can be found at github.com/aimbrain/vqa-project.

Authors (3)

Will Norcliffe-Brown (1 paper)
Efstathios Vafeias (1 paper)
Sarah Parisot (30 papers)

Citations (227)

View on Semantic Scholar

Summary

The paper presents a graph-based model that conditions object relationships on the input question, achieving competitive VQA performance.
It utilizes a graph learner and GCNs to dynamically build adjacency matrices with attention, capturing semantic and spatial interactions.
Visualizations of the learned graphs offer interpretability, making the decision process in complex visual question answering more transparent.

Learning Conditioned Graph Structures for Interpretable Visual Question Answering

The paper "Learning Conditioned Graph Structures for Interpretable Visual Question Answering" presents a method that aims to enhance the field of Visual Question Answering (VQA) using a novel approach based on graph structures. VQA is a complex intersection between the fields of Computer Vision and Natural Language Processing, involving understanding and merging information from an image and a corresponding question into a coherent answer.

Methodology

The authors introduce a graph-based framework that departs from the conventional two-stream strategies common in VQA, which typically merge image and question information via feature embedding techniques. The proposed model utilizes a graph learner module designed to develop a question-specific graph representation of an image. This module, paired with graph convolutional networks (GCNs), allows the model to inherently capture semantic and spatial interactions relating to the specific context posed by the question.

The graph learner constructs a dynamic and conditioned graph representation, where detected image objects are nodes and edges are inferred based on their relevance to the input question. This relevance is captured through attention mechanisms, ensuring that both meaningful nodes and significant object interactions are adequately represented in the learned graph. The adjacency matrix, fundamental to defining the graph's topology, is dynamically learned by the model and leveraged for graph convolutions to refine object feature representations that are pertinent to answering the posed question.

Experimental Results

Utilizing the VQA v2 dataset, the authors report an accuracy of 66.18%, demonstrating that their graph-based approach not only performs competitively against state-of-the-art methods but also offers interpretability advantages. The model's performance across different categories, including Yes/No, Number, and Other, highlights its balanced efficacy in handling diverse question types. Through visualizations of the learned graph structures, the method provides insight into its decision-making process, offering an interpretable layer in the architecture which is often lacking in deep learning approaches.

Contributions and Implications

The primary contribution of this work is the introduction of a graph-based paradigm that conditions object and feature relationships on the context of the question, contrasting with previous methods that largely rely on fixed, pre-defined structures or object-centric approaches without dynamic contextual adaptation. The resultant interpretability and flexibility position this method as a significant step forward in handling the sophisticated nature of VQA tasks, where understanding interactions within scenes is often critical.

The implications of this research extend to a broader set of AI applications where both semantic understanding and interpretability are desirable. The notion of learning a context-specific, dynamic graph structure could be extrapolated to other domains requiring relational reasoning and contextual feature evaluation, such as scene understanding and few-shot learning tasks.

Future research could explore scaling this method with more advanced graph learning techniques or integration with more sophisticated object detection frameworks to mitigate limitations associated with object recognition errors. Furthermore, adaptive architectures could refine or augment the current graph convolutional mechanisms to enable more complex reasoning and lessen performance dependency on detected object quality.

In conclusion, by transcending traditional feature-fusion methods and focusing on learning interpretable and contextually relevant graph structures, this approach contributes a valuable perspective to the VQA domain and beyond, inviting further exploration into adaptive graph reasoning and its applications in AI.

PDF Markdown

Related Papers

GitHub

GitHub - aimbrain/vqa-project: Code for our paper: Learning Conditioned Graph Structures for Interpretable Visual Question Answering (148 stars)