Multi-Modal Answer Validation for Knowledge-Based VQA (2103.12248v3)

Published 23 Mar 2021 in cs.CV and cs.CL

Abstract: The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and commonsense knowledge. Using more knowledge sources increases the chance of retrieving more irrelevant or noisy facts, making it challenging to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using External knowledge (MAVEx), where the idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval. Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source. Our multi-modal setting is the first to leverage external visual knowledge (images searched using Google), in addition to textual knowledge in the form of Wikipedia sentences and ConceptNet concepts. Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results. Our code is available at https://github.com/jialinwu17/MAVEX

Authors (4)

Jialin Wu (30 papers)
Jiasen Lu (32 papers)
Ashish Sabharwal (84 papers)
Roozbeh Mottaghi (66 papers)

Citations (125)

View on Semantic Scholar

Summary

Multi-Modal Answer Validation for Knowledge-Based Visual Question Answering

The paper "Multi-Modal Answer Validation for Knowledge-Based VQA" by Jialin Wu et al. addresses the challenge of visual question answering (VQA) when external knowledge is required beyond the visible content of an image. The authors introduce a framework called Multi-modal Answer Validation using External knowledge (MAVEx), which seeks to refine answer prediction by integrating multi-modal external knowledge, specifically from textual and visual sources.

Overview

Knowledge-based VQA is a domain where answering questions accurately requires information not explicit in the image itself but available through external resources like textual articles or related images. Traditional models often struggle with the noise and irrelevance in the vast ocean of available data which they attempt to leverage to enhance VQA performance. MAVEx innovates by not merely retrieving external knowledge based on the question-image pair but by focusing on validating a set of promising answer candidates with external knowledge in a structured manner.

MAVEx leverages a three-stage method:

Answer Candidate Generation: Using a strong baseline VQA model like ViLBERT, MAVEx generates a shortlist of plausible answer candidates for the given visual question.
Answer Guided Knowledge Retrieval: The framework then performs answer-specific knowledge retrieval from three sources—Wikipedia for factual textual knowledge, ConceptNet for commonsense knowledge, and Google Images for visual data. This delineated approach aims to lessen noise by directing retrieval efforts only towards relevant knowledge fragments.
Answer Candidate Validation: In this stage, each candidate answer is validated using the retrieved multi-modal data. The validation process involves comparing the contextual relevance of the answer candidates with respect to the retrieved knowledge.

Methodological Contributions

Information Sourcing: MAVEx is distinctive as it integrates textual data from Wikipedia and ConceptNet with visually similar data from Google Image search, expanding the utility of retrieved knowledge beyond mere text.
Granular Knowledge Representation: The approach involves a layered embedding of knowledge at different granularities: query-level, noun-phrase level, and question-level. This representation is designed to emphasize critical components of retrieved knowledge that most strongly relate to the question.
Validation through Consistency Checking: The framework's validation strategy involves an interplay of factual consistency checking amongst candidate answers across all knowledge sources, enhancing reliability in selecting the correct answer.

Experimental Evaluation

When tested on OK-VQA, MAVEx achieved a state-of-the-art performance, surpassing previous models with a notable margin. The approach demonstrated a significant incremental benefit of about 5% absolute gain in prediction accuracy when including all three aforementioned knowledge sources. The ensemble of models further improved this performance, underscoring MAVEx's ability to effectively disambiguate and validate answer candidates with high precision.

Implications and Future Directions

MAVEx's methodology has strong implications for the improvement of VQA systems. By structurally addressing the noise in external knowledge retrieval and focusing on multi-source answer validation, the framework addresses one of the most pressing challenges in AI applications that require knowledge beyond immediate data visibility.

Future research could focus on furthering this work by exploring additional knowledge sources or incorporating advanced neural architectures for more efficient information retrieval and validation. Another promising direction involves the enhancement of generative LLMs to more seamlessly guide or influence the knowledge retrieval processes or even manifest contextual understanding within a multi-modal VQA setting.

Conclusion

The research presents a significant contribution to the domain of knowledge-based VQA through the MAVEx framework. By refining the retrieval and validation processes using multi-modal inputs, it enhances the ability of AI systems to provide more accurate responses that are informed by pertinent external knowledge. The results on the challenging OK-VQA dataset attest to the efficacy of this sophisticated, multi-stage approach.

Related Papers

Find Related Papers

GitHub

GitHub - jialinwu17/MAVEX (27 stars)