Analyzing the Behavior of Visual Question Answering Models
The paper conducted in this paper offers a comprehensive analysis of Visual Question Answering (VQA) models, dissecting both their applicable strategies and shortcomings. The motivation lies in the observation that despite an abundance of deep-learning models proposed for VQA, with performances typically ranging between 60-70%, there exists a substantial gap when compared to human accuracy levels (83% for open-ended tasks and 91% for multiple-choice tasks). As the top-9 entries in the VQA Challenge 2016 are separated by a mere 5% margin, a meticulous examination of VQA models is essential to decipher their operational dynamics and potential areas for advancement.
This paper undertakes behavioral analysis for two classes of VQA models—those with attention mechanisms and those without. Specifically, it investigates a CNN+LSTM based model sans attention, a hierarchical co-attention model (ATT), and the Multimodal Compact Bilinear pool (MCB) model, which was the VQA Challenge 2016 winner. The researchers employ various techniques to paper how these models handle the VQA task, which includes examining their ability to generalize to novel instances, their depth of question understanding, and the degree to which they comprehend image content.
The analysis along the dimension of generalization reveals a significant correlation between a model's performance and the familiarity of test instances with those encountered during training. For the CNN+LSTM and ATT models, the high negative correlation (-0.41 and -0.42 respectively) indicates a tangible decline in accuracy with increasingly novel instances. The MCB model, showing a relatively lower correlation (-0.14), suggests better adaptability. Additionally, errors often stem from models regurgitating training data answers rather than effectively synthesizing novel responses, leading to failures in 67-75% of cases across various models.
In assessing complete question understanding, the paper indicates that models often bypass substantive processing of full questions. Specifically, the CNN+LSTM model arrives at a predictive consensus after analyzing only half of the question, implying it places undue weight on certain initial query words. Intriguingly, models display varying sensitivities towards different parts of speech, with wh-words markedly influencing model decisions.
The third analysis on image understanding conveys a tendency among models, especially those without attention mechanisms, to offer consistent responses irrespective of image presentations. The ATT and MCB models demonstrate less rigidity, which might be attributed to their intrinsic attention mechanisms designed to leverage spatial aspects of imagery.
In conclusion, this rigorous behavioral scrutiny of VQA models uncovers several intrinsic deficiencies—these models exhibit "myopic" tendencies, struggle with comprehensive question processing, and inconsistently leverage visual stimuli. Such findings beckon for innovative architectural strategies or alternative datasets to reshape future VQA research agendas. This critique not only benchmarks current limitations but also paves the way for formulating strategies that enhance model generalization, deepen question engagement, and improve image interaction. Moving forward, this form of analysis is indispensable for ensuring that models evolve towards a more nuanced understanding and performance in VQA tasks.