ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Published 6 Aug 2019 in cs.CV and cs.CL | (1908.02265v1)

Abstract: We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

Abstract PDF Upgrade to Chat

Citations (3,386)

View on Semantic Scholar

Summary

The paper presents ViLBERT, which innovatively extends BERT into a two-stream model that integrates visual and linguistic data via co-attentional transformers.
It employs pretraining on 3.3M image-caption pairs with Faster R-CNN extracted features, emphasizing masked multi-modal learning for robust representation.
ViLBERT delivers 2 to 10 percentage points improvements across tasks like VQA, VCR, and image retrieval, setting new performance benchmarks.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Introduction to ViLBERT

ViLBERT, short for Vision-and-Language BERT, represents a significant advancement in the development of task-agnostic visiolinguistic models designed to address multiple vision-and-language tasks. Departing from the conventional task-specific models, ViLBERT innovatively extends the BERT architecture into a two-stream paradigm that integrates visual and linguistic information through co-attentional transformer layers. This structure allows for seamless interaction between modalities, permitting variable depths necessary for processing and sparse interaction via co-attention.

Figure 1: Our ViLBERT model consists of two parallel streams for visual (green) and linguistic (purple) processing that interact through novel co-attentional transformer layers. This structure allows for variable depths for each modality and enables sparse interaction through co-attention.

Architecture and Methodology

The architecture of ViLBERT builds upon the foundations laid by the BERT model, adapting it to accommodate visual components alongside textual data. In ViLBERT, image content is divided into region features using a pre-trained object detection network, specifically Faster R-CNN with a ResNet-101 backbone trained on Visual Genome. This two-stream model comprises separate transformer blocks for visual and linguistic data, merged through co-attentional layers that facilitate context-rich interaction without excessive computational demand.

Figure 2: Standard encoder transformer block.

The core computational element in ViLBERT is the co-attentional transformer layer, which leverages query-key-value mechanisms to produce vision-conditioned language attention and vice versa. The retrieval and integration of attentional features from alternate modalities create opportunities for sophisticated multi-modal representations, capable of holistically understanding visiolinguistic input.

Pretraining Tasks

ViLBERT’s learning process is initiated through self-supervised proxy tasks on the Conceptual Captions dataset, consisting of approximately 3.3 million image-caption pairs collected from alt-text-enabled web images. The pretraining tasks include masked multi-modal learning and multi-modal alignment prediction, where the model reconstructs masked components and assesses alignment between text and images.

Figure 3: Masked multi-modal learning.

These pretraining tasks are essential for training ViLBERT to achieve significant enhancements across various vision-and-language tasks, moving away from isolated learning to a unified approach that leverages pretrainable visual grounding.

Transfer to Vision-and-Language Tasks

The versatility of ViLBERT is demonstrated by its ability to transfer to numerous established tasks with trivial architectural modifications, such as visual question answering (VQA), visual commonsense reasoning (VCR), referring expressions, and caption-based image retrieval. Performance improvements ranging from 2 to 10 percentage points against task-specific baselines validate its efficacy.

Figure 4: Examples for each vision-and-language task we transfer ViLBERT to in our experiments.

ViLBERT sets state-of-the-art benchmarks for tasks like VCR, RefCOCO+, and Flickr30k image retrieval by significant margins, reflecting its robust adaptability and strong representation power.

Experimental Analysis

A series of experiments and baseline comparisons underscored specific contributions of ViLBERT to its success. By isolating elements like the depth of co-attentional blocks or the scale of pretraining datasets, substantial insights into optimization strategies for visiolinguistic reasoning were gained.

Conclusion

ViLBERT is a profound step forward in vision-and-language representation learning, presenting an efficient, scalable, and universally applicable model. Its innovative co-attentional two-stream architecture not only surpasses current task-specific models but also offers a solid basis for future work in multi-modal AI, setting the stage for broader applications requiring integrated processing of visual and linguistic data. As the field advances, further exploration into generation tasks and multi-task learning will capture the full potential of ViLBERT's design.

Figure 5: Qualitative examples of sampled image descriptions from a ViLBERT model after our pretraining tasks, but before task-specific fine-tuning.

With the foundation laid by ViLBERT, avenues for future research include extending these models to other vision-and-language tasks, particularly those necessitating text generation. The complementary capabilities of self-supervised learning frameworks and transformer-based architectures offer promising directions for advancing AI research in holistic visiolinguistic understanding.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (4)

Collections

Tweets

YouTube

Show All Videos

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Summary

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Introduction to ViLBERT

Architecture and Methodology

Pretraining Tasks

Transfer to Vision-and-Language Tasks

Experimental Analysis

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

YouTube