Post-pre-training for Modality Alignment in Vision-Language Foundation Models (2504.12717v1)

Published 17 Apr 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.

Summary

The paper introduces CLIP-Refine, a novel post-pre-training method using small datasets to mitigate the modality gap in Vision-Language Foundation Models via Random Feature Alignment (RaFA) and Hybrid Contrastive-Distillation (HyCD).
Experimental results show CLIP-Refine effectively narrows the modality gap and enhances zero-shot performance, consistently outperforming naive contrastive and multi-modal mixup approaches in various tasks.
CLIP-Refine offers a computationally efficient way to refine pre-trained CLIP models while preserving zero-shot capabilities, suggesting that aligning feature distributions is key to balancing alignment and uniformity.

Overview of "Post-pre-training for Modality Alignment in Vision-Language Foundation Models"

The academic paper "Post-pre-training for Modality Alignment in Vision-Language Foundation Models" addresses a significant challenge in contrastive language image pre-training (CLIP) for vision-LLMs—the modality gap. Despite the impressive zero-shot performance of CLIP models across various tasks, the modality gap continues to hinder more refined downstream performance. This gap refers to the separation between image and text feature clusters within the multi-modal feature space.

Key Contributions

The paper introduces CLIP-Refine, a novel post-pre-training method aimed at mitigating the modality gap without the need for extensive additional training or compromising zero-shot capabilities. The method leverages small-scale image-text datasets for one epoch of training, making it computationally viable. CLIP-Refine incorporates two innovative techniques:

Random Feature Alignment (RaFA): This technique aligns image and text features with a shared distribution by minimizing the Euclidean distance to random reference vectors. These vectors are sampled from a prior distribution, ensuring the features of each modality conform to a common statistical framework. RaFA seeks to improve both the feature alignment and the inherent uniformity of these features on the hypersphere.
Hybrid Contrastive-Distillation (HyCD): HyCD supports learning by employing a modified self-distillation approach, blending ground-truth labels with predictions from the pre-trained CLIP model. This allows the model to preserve prior knowledge while assimilating new information, fostering improved feature alignment. This technique specifically uses KL-divergence-based distillation, leveraging the outputs from the pre-trained model, balanced with actual label data.

Experimental Results

Through extensive experiments across multiple classification and retrieval tasks, the paper demonstrates that CLIP-Refine succeeds in narrowing the modality gap, thereby enhancing zero-shot performance. Notably, CLIP-Refine consistently outperformed both naive contrastive approaches and previous multi-modal mixup techniques, highlighting its efficacy in maintaining the quality of feature distribution as well as knowledge retention.

Implications and Future Directions

Practically, CLIP-Refine offers a robust and computationally efficient approach for refining pre-trained CLIP models. It broadens the scope of modality alignment without compromising zero-shot capabilities, making it a valuable technique for implementing CLIP models in resource-constrained environments.

Theoretically, the findings suggest that aligning feature distributions rather than individual feature pairs can strike a crucial balance between alignment and uniformity on the hypersphere. This insight may inform the development of future methods to optimize representation learning. Additionally, given the paper's results regarding uniform prior distributions, methodological refinements could explore the impact of alternative priors.

Looking ahead, research could further explore scalability, particularly the interaction of CLIP-Refine with larger datasets or more complex models, such as those poised for multi-task learning environments. Moreover, as modality alignment benefits specific downstream tasks, further investigations may reveal domain-specific enhancements that build on the foundational work presented in this paper.

Tweets

https://twitter.com/syamaguchi_en/status/1913107474137358783

https://twitter.com/syamaguchi_en/status/1932286856944886154