ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision (2102.03334v2)

Published 5 Feb 2021 in stat.ML and cs.LG

Abstract: Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.

Authors (3)

Wonjae Kim (25 papers)
Bokyung Son (5 papers)
Ildoo Kim (43 papers)

Citations (1,529)

View on Semantic Scholar

Summary

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

The paper "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" represents a significant step in simplifying Vision-and-Language Pre-training (VLP) by removing the dependency on convolutional neural networks and region supervision for image feature extraction. This research introduces a minimalistic and efficient model named Vision-and-Language Transformer (ViLT), which simplifies the processing of visual inputs to a level comparable to textual inputs through a convolution-free mechanism.

Key Contributions

ViLT distinguishes itself from conventional VLP models with several notable contributions:

Monolithic Architecture: ViLT eliminates the need for deep convolutional visual embedders, relying instead on a linear projection of image patches. This change significantly reduces both parameter size and computation time, making ViLT tens of times faster than its predecessors while maintaining competitive performance.
Pre-training and Fine-tuning: The model is pre-trained using two standard objectives for VLP: Image-Text Matching (ITM) and Masked LLMing (MLM). Additionally, the introduction of Whole Word Masking (WWM) for MLM ensures that the masked parts leverage cross-modal information more effectively.
Efficiency and Speed: By utilizing a simpler embedding of visual inputs, the model achieves unprecedented runtime efficiency. This efficiency is demonstrated across several downstream vision-and-language tasks such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR2), and various retrieval tasks, both zero-shot and fine-tuned.

Methodology

ViLT employs a straightforward pipeline where image patches are linearly projected, similar to how textual tokens are embedded. Initialized with pre-trained ViT weights, the processed visual tokens are combined with textual tokens for transformer-based multimodal processing. The ViLT architecture is devoid of any deep visual embedder and object detection components, which have been traditionally used in VLP models.

The proposed approach showcases a remarkable trade-off, achieving a significant reduction in computation while still performing effectively on complex vision-and-language benchmarks. The results are facilitated by the following design choices:

Patch Projection: Drawing from the Vision Transformer (ViT) approach, ViLT employs a 32x32 patch projection to embed visual inputs before feeding them into the transformer layers. This approach is efficient in terms of both computation and parameter size.
Transformer-based Interaction: ViLT utilizes a single-stream transformer architecture to model inter-modal interactions, which is initialized from ViT-B/32 weights pre-trained on ImageNet. This design consolidates the visual and textual feature extraction and interaction within a unified transformer framework.

Experimental Results

The efficiency and performance of ViLT are extensively evaluated across several tasks:

Classification Tasks: On VQA and NLVR2, ViLT achieves competitive performance (VQA test-dev score of 71.26 and NLVR2 test-P score of 76.13) with substantially lower inference latency (approximately 15ms per instance).
Retrieval Tasks: Evaluated on Flickr30K and MSCOCO benchmarks, ViLT shows strong performance in both zero-shot and fine-tuned retrieval tasks. For instance, in zero-shot text retrieval on MSCOCO, ViLT achieves an R@1 score of 56.5, which is favorable compared to more complex models utilizing convolutional embeddings.

Implications and Future Directions

The implications of this research are substantial as it challenges the status quo of using heavy visual embedders in VLP. ViLT's simplicity and efficiency suggest potential for real-world applications where speed and resource efficiency are crucial. Furthermore, the research opens several avenues for future exploration:

Scalability: Exploring larger ViLT variants (e.g., ViLT-L and ViLT-H) and pre-training on larger datasets could potentially enhance performance.
Enhanced Visual Masking Objectives: Developing more sophisticated masked modeling strategies for visual inputs could further improve the ability to capture complex visual semantics without relying on region-based supervision.
Augmentation Techniques: Further investigation into suitable augmentation strategies tailored for both textual and visual domains could enhance model generalization.

In summary, ViLT represents a streamlined and efficient approach to VLP, demonstrating that convolution-free, transformer-based models can achieve competitive performance in vision-and-language tasks. Future research building on these findings may continue to innovate in the design of efficient and effective multimodal models.

PDF Markdown