Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? (1606.03556v2)

Published 11 Jun 2016 in cs.CV and cs.CL

Abstract: We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.

PDF Abstract

Analysis of Human Attention in Visual Question Answering

The research presented in the paper "Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?" investigates the correlation between human attention and attention mechanisms in deep learning models applied to Visual Question Answering (VQA). This paper introduces the VQA-HAT dataset, which contains human attention maps acquired through game-like annotation interfaces that mimic how humans sharpen specific regions of blurred images to answer questions. The primary objective is to assess whether current attention models deployed in VQA align with human cognitive processes in selecting image regions for answering questions.

The researchers conducted extensive experiments to collect "human attention maps," which serve as a benchmark for evaluating the attention generated by models. Attention in VQA tasks is crucial since questions posed target specific image areas rather than the whole, suggesting that effective VQA models could benefit from focusing attention on relevant image parts.

Methodology and Data Collection

The dataset collection involved designing three variants of attention-annotation interfaces, each presenting a different way for human subjects to interact with blurred images to answer questions. Among these, the "Blurred Image with Answer" approach yielded the best performance from annotators, achieving a human accuracy of 78.7% when the sharpened image regions were provided to a different set of users for validation, indicating optimal balance between exploration and precision.

Comparison with Model-Generated Attention

Correlation metrics were employed to compare human attention maps against those generated by state-of-the-art attention-based VQA models, including the Stacked Attention Network (SAN) and Hierarchical Co-Attention Network (HieCoAtt). The rank correlation results revealed that although VQA models produce attention maps that positively correlate with human maps, their extent is limited compared to task-independent saliency maps, which demonstrate a higher correlation.

Implications and Future Directions

The findings suggest that current VQA attention models do not accurately replicate human attention processes, as evidenced by the lower correlation scores. The paper highlights a critical insight, signifying that as VQA model accuracy improves, the correlation with human-like attention also exhibits evidence of enhancement. This correlation offers a valuable feedback loop for developing more sophisticated attention mechanisms closely aligned with human visual processing.

The research introduces the VQA-HAT dataset as a public resource to propel further comparisons and evaluation of unsupervised attention mechanisms, creating potential for improved VQA models. From a theoretical angle, the paper stimulates discussion on defining what constitutes necessary and sufficient attention in relation to human cognition—a question central to bridging the gap between artificial and human-like visual understanding.

The significance of this research lies in opening potential pathways for integrating human attention models into AI systems, which can enhance their interpretative and interactive capabilities, especially in complex visual environments. Future explorations could seek to refine attention models with a focus on semantically grounded features that are closely aligned with human perceptual streams, advancing the state of AI in understanding context and content within visual scenes.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Abhishek Das (61 papers)
Harsh Agrawal (20 papers)
C. Lawrence Zitnick (50 papers)
Devi Parikh (129 papers)
Dhruv Batra (160 papers)

Citations (459)

View on Semantic Scholar

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? (1606.03556v2)

Analysis of Human Attention in Visual Question Answering

Methodology and Data Collection

Comparison with Model-Generated Attention

Implications and Future Directions

Related Papers