VisualBERT: A Simple and Performant Baseline for Vision and Language (1908.03557v1)

Published 9 Aug 2019 in cs.CV, cs.CL, and cs.LG

Abstract: We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded LLM objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

Citations (1,781)

View on Semantic Scholar

Summary

The paper introduces VisualBERT, a baseline model that fuses visual features from object detectors with text embeddings using a multi-layer Transformer.
It employs task-agnostic pre-training on COCO captions followed by task-specific fine-tuning for applications like VQA, VCR, NLVR2, and Flickr30K.
Experimental results show VisualBERT achieves competitive accuracy and recall, outperforming previous models on several vision-and-language benchmarks.

VisualBERT: A Simple and Performant Baseline for Vision and Language

The paper, "VisualBERT: A Simple and Performant Baseline for Vision and Language" by Liunian Harold Li et al., proposes VisualBERT, a robust yet straightforward model that integrates visual and language data for various tasks. The authors present VisualBERT as a versatile baseline model capable of excelling in diverse vision-and-language tasks.

Model Architecture and Training

VisualBERT extends BERT, a popular Transformer-based model, by incorporating visual embeddings alongside textual embeddings. Each image is processed through a pre-trained object detector, such as Faster-RCNN, to generate a set of object region features, which serve as visual input tokens. Text and visual tokens are integrated into a multi-layer Transformer, enabling self-attention mechanisms to discover implicit alignments between textual and visual elements.

Pre-training involves two visually-grounded LLM objectives using image caption pairs from the COCO dataset:

Masked LLMing with the image: Masks parts of the text, which the model must predict using the remaining text and visual context.
Sentence-image prediction: Determines whether a given text matches the associated image.

The training follows three phases:

Task-agnostic pre-training on COCO captions using the two objectives.
Task-specific pre-training on the target task dataset, adjusting the model to the specific domain.
Fine-tuning with a task-specific output layer and objective, optimizing the model on the downstream application.

Performance Evaluation

VisualBERT was evaluated on four vision-and-language tasks: Visual Question Answering (VQA 2.0), Visual Commonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR $^2$ ), and Region-to-Phrase Grounding (Flickr30K).

VQA 2.0: VisualBERT demonstrated superior performance compared to the Pythia models, achieving an accuracy of 70.80 on Test-Dev and 71.00 on Test-Std datasets.
VCR: VisualBERT achieved notable results, surpassing the R2C model, a BERT-based approach, with a significant margin. For Q $\rightarrow$ A and QA $\rightarrow$ R tasks, it achieved 71.6 and 73.2 on the test data, respectively.
NLVR $^2$ : VisualBERT outperformed the previously best model MaxEnt, showing a considerable improvement with an accuracy of 67.0 on the Test-P set.
Flickr30K: The model excelled, obtaining a Recall@1 score of 71.33 on the test set, surpassing the previous state-of-the-art model, BAN.

Ablation Study and Analysis

An extensive ablation paper highlighted the importance of several key components:

Task-agnostic pre-training: Critical for learning useful image-text associations.
Early fusion of visual and text inputs: Allowing multiple layers of interaction significantly improved performance.
BERT initialization: Found to be beneficial, though training from scratch with COCO pre-training still performed competently.

Further analysis using Flickr30K revealed that VisualBERT's attention heads could implicitly track entity grounding and syntactic dependencies without direct supervision. Attention weights refined through successive layers, aligning text and image regions accurately and capturing syntactic relationships, such as subject-verb or object-verb connections.

Conclusion and Future Directions

The research demonstrates VisualBERT's efficacy in various vision-and-language tasks, emphasizing the simplicity and versatility of integrating visual inputs into a Transformer model. Future work could explore extending VisualBERT to image-only tasks and pre-training on more extensive caption datasets, potentially enhancing its capabilities further.

PDF Markdown