SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions (2001.06927v2)

Published 20 Jan 2020 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the world, logic and / or reasoning. Analyzing performance across this distinction allows us to notice when existing VQA models have consistency issues; they answer the reasoning questions correctly but fail on associated low-level perception questions. For example, in Figure 1, models answer the complex reasoning question "Is the banana ripe enough to eat?" correctly, but fail on the associated perception question "Are the bananas mostly green or yellow?" indicating that the model likely answered the reasoning question correctly but for the wrong reason. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split. Our evaluation shows that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the model to attend to the same parts of the image when answering the reasoning question and the perception sub question. We show that SQuINT improves model consistency by ~5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.

PDF Abstract

Introspecting VQA Models Through Sub-Questions: An Analytical Overview

The paper "SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions" introduces a methodological advancement in the evaluation and enhancement of Visual Question Answering (VQA) systems. The authors, Selvaraju et al., emphasize the complexity within VQA tasks by distinguishing between perception and reasoning questions; a vital bifurcation for understanding and improving model capabilities.

The primary concern addressed by the paper is the consistency of VQA model responses between complex reasoning questions and their simpler perceptual counterparts. Existing datasets often mix both without distinction, failing to comprehensively evaluate a model's reasoning abilities. The authors propose a "Reasoning split," focusing on questions that require an integration of perception with logical reasoning or external knowledge. To facilitate this, they introduce VQA-introspect, a dataset containing 238,000 perception sub-questions aligned with reasoning questions in the VQA Reasoning split.

Methodology and Evaluation

The authors quantify model inconsistencies by analyzing cases where models correctly respond to reasoning questions but err on linked perception sub-questions. This discrepancy suggests that models might be using incorrect logic or dataset biases rather than genuine image comprehension to achieve high reasoning accuracy. The paper proposes Sub-Question Importance-aware Network Tuning (SQuINT) as a solution—a method that encourages attention alignment across related questions.

Empirical results highlight that current state-of-the-art VQA models, while achieving comparable success on perception and reasoning tasks, suffer significant consistency issues. For example, in 28.14% of cases where reasoning answers were correct, corresponding perception questions were answered incorrectly. SQuINT brings about a 5% improvement in model consistency, slightly boosting reasoning question performance and substantially improving attention alignment.

Practical and Theoretical Implications

Practically, this work acts as a paradigm for future VQA datasets and models to account for perceptual reasoning discrepancies more thoroughly, possibly guiding enhancements in applications like autonomous driving, robotics, and assistive technologies. Theoretically, the paper challenges researchers to develop models that are accountable not just for their outputs but for the reasoning paths they traverse, ensuring robust generalization capabilities.

Future Directions

This line of research could be extended by exploring finer subdivisions within reasoning, like separating logical reasoning from world-knowledge utilization. Furthermore, advancements in transparency and accountability of model decisions might emerge from integrating this wisdom into AI systems more broadly. Leveraging VQA-introspect and SQuINT could also lead to improved adaptive systems better suited to context-rich environments, paving the way for more nuanced AI in interactive and dynamic domains.

The paper’s emphasis on not conflating perceptual and reasoning tasks and demonstrating a method for enhancing model reasoning fidelity marks an incremental yet significant step in the journey towards more capable and trustworthy AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ramprasaath R. Selvaraju (14 papers)
Purva Tendulkar (9 papers)
Devi Parikh (129 papers)
Eric Horvitz (76 papers)
Marco Ribeiro (3 papers)
Besmira Nushi (38 papers)
Ece Kamar (37 papers)

Citations (14)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos