Analyzing the Behavior of Visual Question Answering Models (1606.07356v2)

Published 23 Jun 2016 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the behavior of these models as a first step towards recognizing their strengths and weaknesses, and identifying the most fruitful directions for progress. We analyze two models, one each from two major classes of VQA models -- with-attention and without-attention and show the similarities and differences in the behavior of these models. We also analyze the winning entry of the VQA Challenge 2016. Our behavior analysis reveals that despite recent progress, today's VQA models are "myopic" (tend to fail on sufficiently novel instances), often "jump to conclusions" (converge on a predicted answer after 'listening' to just half the question), and are "stubborn" (do not change their answers across images).

PDF Abstract

Analyzing the Behavior of Visual Question Answering Models

The paper conducted in this paper offers a comprehensive analysis of Visual Question Answering (VQA) models, dissecting both their applicable strategies and shortcomings. The motivation lies in the observation that despite an abundance of deep-learning models proposed for VQA, with performances typically ranging between 60-70%, there exists a substantial gap when compared to human accuracy levels (83% for open-ended tasks and 91% for multiple-choice tasks). As the top-9 entries in the VQA Challenge 2016 are separated by a mere 5% margin, a meticulous examination of VQA models is essential to decipher their operational dynamics and potential areas for advancement.

This paper undertakes behavioral analysis for two classes of VQA models—those with attention mechanisms and those without. Specifically, it investigates a CNN+LSTM based model sans attention, a hierarchical co-attention model (ATT), and the Multimodal Compact Bilinear pool (MCB) model, which was the VQA Challenge 2016 winner. The researchers employ various techniques to paper how these models handle the VQA task, which includes examining their ability to generalize to novel instances, their depth of question understanding, and the degree to which they comprehend image content.

The analysis along the dimension of generalization reveals a significant correlation between a model's performance and the familiarity of test instances with those encountered during training. For the CNN+LSTM and ATT models, the high negative correlation (-0.41 and -0.42 respectively) indicates a tangible decline in accuracy with increasingly novel instances. The MCB model, showing a relatively lower correlation (-0.14), suggests better adaptability. Additionally, errors often stem from models regurgitating training data answers rather than effectively synthesizing novel responses, leading to failures in 67-75% of cases across various models.

In assessing complete question understanding, the paper indicates that models often bypass substantive processing of full questions. Specifically, the CNN+LSTM model arrives at a predictive consensus after analyzing only half of the question, implying it places undue weight on certain initial query words. Intriguingly, models display varying sensitivities towards different parts of speech, with wh-words markedly influencing model decisions.

The third analysis on image understanding conveys a tendency among models, especially those without attention mechanisms, to offer consistent responses irrespective of image presentations. The ATT and MCB models demonstrate less rigidity, which might be attributed to their intrinsic attention mechanisms designed to leverage spatial aspects of imagery.

In conclusion, this rigorous behavioral scrutiny of VQA models uncovers several intrinsic deficiencies—these models exhibit "myopic" tendencies, struggle with comprehensive question processing, and inconsistently leverage visual stimuli. Such findings beckon for innovative architectural strategies or alternative datasets to reshape future VQA research agendas. This critique not only benchmarks current limitations but also paves the way for formulating strategies that enhance model generalization, deepen question engagement, and improve image interaction. Moving forward, this form of analysis is indispensable for ensuring that models evolve towards a more nuanced understanding and performance in VQA tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Aishwarya Agrawal (28 papers)
Dhruv Batra (160 papers)
Devi Parikh (129 papers)

Citations (303)

View on Semantic Scholar

Analyzing the Behavior of Visual Question Answering Models (1606.07356v2)

Analyzing the Behavior of Visual Question Answering Models

Related Papers