Fusion of Detected Objects in Text for Visual Question Answering

Published 14 Aug 2019 in cs.CL, cs.CV, and cs.LG | (1908.05054v2)

Abstract: To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The "Bounding Boxes in Text Transformer" (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark (https://visualcommonsense.com), achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard (as of May 22, 2019). A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. A reference implementation of our models is provided (https://github.com/google-research/language/tree/master/language/question_answering/b2t2).

Abstract PDF Upgrade to Chat

Citations (171)

View on Semantic Scholar

Summary

Fusion of Detected Objects in Text for Visual Question Answering

The paper "Fusion of Detected Objects in Text for Visual Question Answering" introduces the Bounding Boxes in Text Transformer (B2T2), a novel neural architecture devised for multimodal data integration, accentuating the fusion of visual and linguistic information. The study aims to refine the process of Visual Question Answering (VQA) through early integration of visual context within the text-based analysis rather than relying on traditional late fusion methods.

Model Architecture and Methodology

The B2T2 model is a prominent solution presented in this paper, where bounding boxes within images are interwoven at the initial stages with text tokens in the Transformer architecture. This method facilitates an enriched embedding space where textual and visual features coexist cohesively. Two primary architectures are evaluated: the Dual Encoder, which implements late fusion, and the B2T2, advocating early fusion.

Dual Encoder: This model analyzes text and images separately, merging their representations only at the classification stage, through a similarity calculation like dot-product.
B2T2: Here, bounding boxes — regions of interest in an image, shown as part of the text input — allow for the synchronous processing of visual context at the token embedding level, thereby enhancing interpretative capabilities for VQA.

Experiments and Results

The paper's experiments focus extensively on the Visual Commonsense Reasoning (VCR) benchmark, noted for its complexity and need for intricate contextual understanding. Results demonstrated noteworthy advancements from B2T2 over established models, reducing error rates by 25% relative to previous best-known systems. Notably, pretraining B2T2 on Conceptual Captions fortified its performance on VCR tasks, stabilizing training epochs and mitigating performance variance.

Critical Findings and Ablations

A series of ablation studies elucidated several aspects of the B2T2 model:

Inclusion of bounding boxes significantly elevates model performance, underscoring their role in contextual anchoring.
Early fusion strategies are found to be substantially more effective than late fusion, as they allow for deeper integration of cross-modal cues at the token level.
The robustness of the model is partly attributed to leveraging large pretrained models (BERT-Large and ResNet-152).

Implications and Future Directions

The B2T2 architecture exemplifies how integrating visual information at an earlier stage within textual processing can lift performance metrics in VQA tasks. This paves pathways for future exploration in multimodal processing, such as embedding features beyond object detection — including activities and expressions — to diversify contextual understanding further. The implications for AI entail more nuanced and comprehensive systems capable of tackling complex real-world scenarios through richer multimodal interactions.

The paper closes by positioning B2T2 within broader efforts to evolve VQA systems, acknowledging contemporary work like VL-BERT and VisualBERT, which suggest directions for adapting and expanding upon the early fusion concept for various vision-language tasks. Understanding these dynamics is critical as the field progresses toward more holistic AI models capable of interpreting multimodal information with greater precision and flexibility.