ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (1908.02265v1)

Published 6 Aug 2019 in cs.CV and cs.CL

Abstract: We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

PDF Abstract

Overview of ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

The paper presents ViLBERT, an extension of the BERT architecture designed to handle multimodal inputs consisting of both visual and textual data. The authors aim to develop a task-agnostic joint representation of images and natural language that could be transferred across various vision-and-language tasks. This approach introduces a novel two-stream model with co-attentional transformer layers, separating visual and linguistic processing yet allowing interaction at different representation levels.

Methodology

ViLBERT is built upon the architecture of BERT but is tailored to accommodate the distinct needs of visual and textual data processing:

Two-Stream Architecture: Visual and textual inputs are processed in separate streams. The streams interact through co-attentional transformer layers, allowing the model to fuse visual and linguistic information at multiple depths.
Training Tasks: The model is pretrained on the Conceptual Captions dataset using two proxy tasks:
- Masked Multi-Modal Modeling: Inspired by BERT’s masked LLMing, this task involves reconstructing masked words and image regions.
- Multi-Modal Alignment Prediction: Predicting if a given caption matches an image.
Training Architecture: The visual stream uses Faster R-CNN features, and the linguistic stream is initialized with weights from a pretrained BERT model. These streams are then jointly trained end-to-end.

Experimental Results

ViLBERT was tested on several established vision-and-language tasks, achieving significant improvements:

Visual Question Answering (VQA): It demonstrated improved performance on the VQA 2.0 dataset, benefitting from the two-stream architecture and pretraining.
Visual Commonsense Reasoning (VCR): Achieved state-of-the-art results in both question answering and answer justification subtasks.
Grounding Referring Expressions: Outperformed existing models in the RefCOCO+ task by effectively associating natural language references to specific image regions.
Caption-Based Image Retrieval: Demonstrated robust performance on the Flickr30k dataset, transferring the pretrained visiolinguistic representations effectively to image retrieval tasks.
Zero-Shot Caption-Based Image Retrieval: Without task-specific fine-tuning, the model showed impressive results, indicating the generalization capability of the learned representations.

Numerical Results

Notable numerical improvements were observed across multiple tasks:

VQA: Improvement from 68.85 to 70.55 on the test-dev set.
VCR: 54.04 accuracy on Q→AR compared to 47.27 without pretraining.
RefCOCO+: Gains of around 4 percentage points in accuracy compared to non-pretrained models.
Image Retrieval: Significant gains in recall metrics, with a recall@1 improvement from 45.50 to 58.20 compared to non-pretrained models.

Implications and Future Directions

Practical Implications:

ViLBERT’s task-agnostic pretraining paradigm offers a unified and powerful baseline for a variety of vision-and-language tasks. It simplifies the adaptation to new tasks by requiring minimal architectural modifications.

Theoretical Implications:

The two-stream architecture with co-attentional transformers provides a novel mechanism for multimodal data fusion, addressing the distinct processing requirements of visual and textual information.

Speculative Future Directions:

Extending ViLBERT to tasks involving sequences of images (e.g., video captioning, visual dialog) remains an open problem.
Investigation of multi-task learning approaches where ViLBERT is jointly trained on a spectrum of vision-and-language tasks could further leverage the model's capabilities.
Enhancing the decoder mechanisms for text generation tasks could expand ViLBERT’s utility to tasks requiring natural language output, such as image captioning or narrative generation.

ViLBERT represents a significant step forward in developing generalized, pretrainable visiolinguistic models. Its architecture and pretraining paradigm set a robust foundation for future research in multimodal AI systems, showing how joint visual and linguistic understanding can be achieved and transferred effectively across diverse tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jiasen Lu (32 papers)
Dhruv Batra (160 papers)
Devi Parikh (129 papers)
Stefan Lee (62 papers)

Citations (3,386)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/bryanS465311645/status/1796717488434249773

YouTube

Show All Videos