FILIP: Fine-grained Interactive Language-Image Pre-Training (2111.07783v1)

Published 9 Nov 2021 in cs.CV and cs.LG

Abstract: Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.

PDF Abstract

FILIP: Fine-grained Interactive Language-Image Pre-Training

The paper "FILIP: Fine-grained Interactive Language-Image Pre-Training" introduces an innovative approach to vision-language pre-training, addressing limitations of existing models such as CLIP and ALIGN. These models typically focus on global feature similarity and overlook finer-grained interactions, which restricts their ability to capture relationships between visual objects and textual descriptors.

Methodology

FILIP proposes a novel cross-modal late interaction mechanism to refine these interactions. This mechanism uses a token-wise maximum similarity between visual and textual tokens, enhancing the contrastive learning objective. By modifying only the contrastive loss, FILIP retains the capability to pre-compute image and text representations offline, maintaining efficiency during both training and inference. This model adopts a dual-stream architecture with Transformer-based encoders for images and text, akin to CLIP's setup. It departs from traditional methods by focusing on finer-grained expressiveness and word-patch alignment without reliance on cross/self-attention, which often suffer from efficiency issues.

Dataset and Experiments

To augment its training, FILIP utilizes a new dataset, FILIP300M, alongside public datasets like CC3M and CC12M, totaling around 340 million image-text pairs. The dataset collection involved rigorous filtering to ensure high-quality image-text alignment. Experiments demonstrate FILIP's superior performance across several downstream tasks:

Zero-shot Image Classification: FILIP achieves notable improvements on 12 classification datasets, including ImageNet, suggesting its robustness in domain-specific tasks.
Image-Text Retrieval: FILIP’s results demonstrate state-of-the-art retrieval performance on Flickr30K and MSCOCO, with enhanced recall metrics in both zero-shot and fine-tuned settings.

Results and Implications

Results show that FILIP not only surpasses baseline models like CLIP and ALIGN in various tasks but does so with less training data. This efficiency is a direct consequence of its fine-grained alignment capability. Visualizations of word-patch alignments further highlight FILIP’s ability to localize meaningful features, which is pivotal for tasks requiring detailed visual-textual correlation.

Future Directions

The paper hints at potential further optimizations, such as augmenting the architecture with advanced image encoders or incorporating additional language/image modeling objectives to support broader generation tasks. These avenues suggest an ongoing trajectory towards more versatile and generalized vision-language interfaces.

In summary, FILIP significantly advances the field of vision-language pre-training by introducing a fine-grained interaction mechanism that improves both accuracy and efficiency. Its approach opens pathways for more nuanced understanding and representation of visual-textual data, proposing a model poised to influence future hybrid AI developments.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Lewei Yao (15 papers)
Runhui Huang (18 papers)
Lu Hou (50 papers)
Guansong Lu (20 papers)
Minzhe Niu (11 papers)
Hang Xu (204 papers)
Xiaodan Liang (318 papers)
Zhenguo Li (195 papers)
Xin Jiang (242 papers)
Chunjing Xu (66 papers)

Citations (554)

View on Semantic Scholar

FILIP: Fine-grained Interactive Language-Image Pre-Training (2111.07783v1)