VQA-LOL: Visual Question Answering under the Lens of Logic (2002.08325v2)

Published 19 Feb 2020 in cs.CV and cs.CL

Abstract: Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this \textit{Lens of Logic}, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our {Lens of Logic (LOL)} model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fr\'echet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.

Citations (72)

View on Semantic Scholar

Summary

The paper introduces VQA-LOL, a new framework and model architecture that significantly improves Visual Question Answering systems' ability to handle logically composed questions.
It presents two new datasets, VQA-Compose and VQA-Supplement, and the Lens of Logic (LOL) model with a novel Fréchet-Compatibility Loss to embed logical reasoning.
Results show the LOL model increases accuracy on logical VQA questions from ~50% to over 80% while maintaining performance on standard VQA datasets, demonstrating enhanced robustness and reasoning.

Visual Question Answering under the Lens of Logic

The paper "VQA-LOL: Visual Question Answering under the Lens of Logic" by Tejas Gokhale et al. addresses the limitations of current Visual Question Answering (VQA) systems when confronted with logically composed questions. It identifies a gap in the ability of state-of-the-art VQA models to handle logical operations such as negation, conjunction, and disjunction within the context of question answering associated with images. This research provides a novel approach to enhance VQA systems by embedding logical reasoning capabilities, thus improving their robustness.

Methodology

The authors present two newly crafted datasets, VQA-Compose and VQA-Supplement, which augment the VQA dataset with questions featuring logical compositions. These datasets serve as benchmarks to evaluate the logical reasoning capabilities of VQA models. Building upon the baseline LXMERT framework, the paper introduces the Lens of Logic (LOL) model architecture. This model incorporates question-attention and logic-attention modules to categorize question types and identify logical connectives. The novel Fréchet-Compatibility Loss is proposed to ensure consistent answers corresponding to the component questions and their logical compositions.

Results and Analysis

The research demonstrates a significant improvement in answering logically composed questions while maintaining performance on the conventional VQA dataset. Specifically:

A noted increase in accuracy for logically composed questions from near 50% to over 80% when using the LOL model compared to traditional models.
No substantial deviation of the LOL model's performance on the standard VQA test set, suggesting successful integration of logical reasoning without compromising overall VQA capabilities.

The authors conduct several experiments to assess the robustness and generalization capability of the proposed model, such as:

Compositional Generalization: Training on single-operation questions and testing on multi-operation questions.
Inductive Generalization: Testing on questions composed of more than two elements.
Parser-based vs. End-to-End (E2E) models: Highlighting the inadequacies of traditional parser-based approaches compared to the proposed E2E model architecture.

Implications and Future Work

The paper suggests that incorporating logical reasoning into VQA models can enrich their interpretative capabilities, especially in scenarios involving complex natural language expressions. The approach paves the way for future work in enhancing conversational agents by developing models that understand and react to logical structures embedded in language.

Potential practical applications include improved automatic image annotation and enhanced user-interactive systems requiring visual comprehension. The paper hints at utilizing similar logic-infused strategies for object recognition and scene understanding at higher abstraction levels.

Continued research in this avenue could lead to models that not only respond accurately to binary logic but extend to more complex, explanatory interactions akin to human reasoning, thus enriching AI's capability to interface seamlessly with human operators.

This research is a valuable contribution to the VQA domain, enhancing logical robustness and exemplifying a crucial step towards more sophisticated AI-based image understanding systems.

Related Papers

YouTube

Show All Videos