Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training (2207.12661v1)

Published 26 Jul 2022 in cs.CV and cs.CL

Abstract: Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space. Typically, this has employed separate encoders for each modality. However, recent work suggests that transformers can support learning across multiple modalities and allow knowledge sharing. Inspired by this, we investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that light-weight modality-specific parallel modules further improve performance. Experimental results show that the proposed MS-CLIP approach outperforms vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification (pre-trained on YFCC-100M), while simultaneously supporting a reduction of parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks. Furthermore, we discover that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure (e.g., attention patterns) from language to vision. Code is available at \href{https://github.com/Hxyou/MSCLIP}{URL}.

PDF Abstract

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

The paper, "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training," presents a comprehensive paper on modality-shared architectures for contrastive language-image pre-training, termed MS-CLIP. The focus is on sharing parameters between language and image encoding transformers to enhance knowledge transfer between modalities, reduce model size, and improve downstream performance. This approach addresses potential inefficiencies in conventional models that utilize separate encoders for each modality, such as the original CLIP, which leverages distinct visual and textual transformers.

Modality-Shared Architecture

The paper proposes that using a predominantly unified encoder for both vision and language tasks can yield significant performance gains compared to models with separate modality-specific encoders. The researchers formally test various configurations to determine the optimal balance of shared and distinct parameters in a transformer-based model. Their findings highlight that the best configuration involves sharing most of the transformer’s layers across both text and image inputs, with exceptions for input embedding, layer normalization (LN), and output projection, which remain modality-specific. Notably, the shared architecture enables a closer alignment of semantic representations in the embedding space between vision and language modalities, facilitating improved transfer of learned knowledge.

Modality-Specific Enhancements

The paper introduces two auxiliary modifications to further enhance the MS-CLIP architecture: Early Specialization and an Efficient Parallel Branch. Early Specialization involves leveraging lightweight modality-specific modules in the initial transformer layer: a residual convolutional network for the vision pathway providing spatial invariance, and a transformer layer for language. It allows representations to specialize early before being processed by the shared modality encoder.

The Efficient Parallel Branch is a lightweight convolutional module parallel to the shared transformer, used only in the visual pathway to incorporate multi-scale image features through depth-wise convolutions. This addition further integrates valuable spatial information without excessively increasing computational cost.

Empirical Evaluation

The proposed MS-CLIP model demonstrates superior performance across various vision tasks compared to the conventional CLIP model. This is evidenced by up to a 13% improvement in zero-shot ImageNet classification when pretrained on YFCC-100M, and consistently higher linear probing results across 24 downstream tasks. Importantly, the MS-CLIP achieves these results using fewer parameters. Additionally, the paper reports that its modality-shared parameters enhance semantic alignment between the modalities, which is quantified through mutual information and attention pattern metrics.

Implications and Future Directions

The success of MS-CLIP suggests substantial potential for architectures that unify multiple modalities into shared computational structures, providing a pathway toward more parameter-efficient and semantically robust models. These findings have implications for improving performance on multi-modal tasks and could influence future research in areas such as vision-LLMs and multi-modal fusion. Further exploration could optimize these architectures for broader datasets or extend these insights into other applications within AI, emphasizing efficient cross-modal learning and resource utilization.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Haoxuan You (33 papers)
Luowei Zhou (31 papers)
Bin Xiao (93 papers)
Noel Codella (21 papers)
Yu Cheng (354 papers)
Ruochen Xu (35 papers)
Shih-Fu Chang (131 papers)
Lu Yuan (130 papers)

Citations (39)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Hxyou/MSCLIP: Official Code of ECCV 2022 paper MS-CLIP (89 stars)