Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (2206.07643v2)

Published 15 Jun 2022 in cs.CV, cs.CL, and cs.LG

Abstract: Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

PDF Abstract

Coarse-to-Fine Vision-Language Pre-Training with Fusion in the Backbone

The paper "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone" presents Fiber, a novel model architecture tailored for vision-language tasks. This proposed architecture integrates fusion directly into the backbone, aiming to effectively tackle both high-level semantic tasks (e.g., VQA, image captioning) and fine-grained tasks (e.g., object detection, phrase grounding).

Model Architecture and Approach

Fiber’s architecture is distinguished by its innovative approach to multimodal fusion. Traditional approaches typically apply fusion in dedicated layers added to uni-modal backbones. In contrast, Fiber incorporates cross-attention mechanisms directly within the backbones of the encoders, which enables a deeper integration of multimodal information earlier in the processing pipeline. This deep integration aims to improve memory efficiency and overall performance.

The paper also introduces a two-stage pre-training strategy, termed coarse-to-fine pre-training. This method begins with coarse-grained pre-training on image-text data, proceeding to fine-grained pre-training on detailed image-text-box annotated datasets. This staged approach is devised to efficiently leverage both broad and detailed data resources.

Experimental Results and Performance

The experiments demonstrate Fiber's effectiveness across a spectrum of vision-language tasks. Fiber’s performance is consistently strong, often surpassing existing baselines. Specifically, Fiber shows significant improvements in task accuracy and efficiency, even outperforming methods trained on larger datasets. This robustness is particularly notable in image-level tasks such as VQA and retrieval, and it extends to region-level tasks such as phrase grounding and object detection.

Technical Contributions and Implications

The technical contributions of this paper lie in the novel fusion strategy within the backbones and the biphasic pre-training regimen. The inclusion of cross-attention layers with a gating mechanism introduces flexibility, allowing the model to switch between dual encoder and fusion encoder modes, optimizing performance for different tasks efficiently.

Practically, the Fiber model holds promise for applications requiring detailed image and text analysis, such as automated captioning in content generation or precise object detection in autonomous driving systems. Theoretically, the successful integration and adaptation of transformers in such a multimodal setting push the boundaries for future research in multimodal learning architectures.

Future Directions

As the model development and pre-training techniques continue to evolve, future work may explore scaling the Fiber architecture and adapting the approach to other modalities beyond vision and language. Extending these techniques to integrate video data or other sensory inputs could be a promising direction. Moreover, addressing potential biases and ensuring ethical deployment in real-world applications will be crucial as these models advance.

In conclusion, this paper advances the field of vision-LLMs through its coherent integration of modalities and efficient pre-training strategies, setting a strong foundation for future explorations in multimodal AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Zi-Yi Dou (33 papers)
Aishwarya Kamath (11 papers)
Zhe Gan (135 papers)
Pengchuan Zhang (58 papers)
Jianfeng Wang (149 papers)
Linjie Li (89 papers)
Zicheng Liu (153 papers)
Ce Liu (51 papers)
Yann LeCun (173 papers)
Nanyun Peng (205 papers)
Jianfeng Gao (344 papers)
Lijuan Wang (133 papers)

Citations (112)

View on Semantic Scholar

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (2206.07643v2)