Diffusion Feedback Helps CLIP See Better (2407.20171v4)

Published 29 Jul 2024 in cs.CV

Abstract: Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal LLMs (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is available at https://github.com/baaivision/DIVA.

PDF HTML Abstract

Enhancing CLIP's Fine-Grained Visual Perception via Diffusion Feedback

In recent years, the Contrastive Language-Image Pre-training (CLIP) models have emerged as a robust framework for multimodal tasks, including image classification, retrieval, visual grounding, and text-to-image generation. However, it is well-documented that CLIP models exhibit notable deficiencies in perceiving fine-grained visual details, such as orientation, quantity, color, and structural information. This paper addresses these visual shortcomings by proposing a novel self-supervised post-training approach, the Diffusion Visual Assistant (DIVA), which leverages generative feedback from text-to-image diffusion models to optimize CLIP's representations.

Methodology

The core idea behind DIVA is to utilize a pre-trained diffusion model to provide generative guidance, enhancing the visual features encoded by CLIP. The framework consists of two main components: the CLIP model and a conditional generative diffusion model. The CLIP model processes input images to extract visual features, which are then incorporated into the condition provided to the diffusion model. The generative diffusion model is tasked with reconstructing the noisy image input, and the optimization process is guided by minimizing the reconstruction loss while keeping the diffusion model's parameters fixed. This strategy enables the CLIP model to learn richer visual representations that encapsulate finer details.

A crucial aspect of DIVA's design is the incorporation of a visual dense recap scheme into the diffusion condition. By embedding both the global visual class token and a subset of localized patch features, the diffusion model can leverage dense visual information, thereby refining the CLIP model's features more effectively. The optimal balance of local patch tokens is empirically determined, ensuring sufficient detail without overwhelming the optimization process.

Experimental Results

DIVA demonstrates a significant improvement in the fine-grained visual perception capabilities of various CLIP models. Extensive evaluations on the MMVP-VLM benchmark, which measures the visual abilities of vision-LLMs, show that DIVA consistently boosts performance by approximately 4-7%. The results indicate enhanced accuracy in recognizing detailed visual patterns such as orientation, specific features, state, quantity, relational context, color, structure, text, and perspective.

Moreover, the enhanced CLIP models also exhibit improved performance in multimodal LLMs (MLLMs) and visual understanding tasks, including image classification and retrieval benchmarks. Notably, the generalization capability of the CLIP models remains largely unaffected by the fine-tuning process. Experiments across 29 benchmarks validate that zero-shot image classification and retrieval performance is preserved, underscoring DIVA's effectiveness in maintaining the broad semantic understanding intrinsic to CLIP models.

Implications and Future Directions

The implications of this research are multi-faceted. The DIVA framework not only addresses specific visual deficiencies in CLIP models but also opens new avenues for enhancing foundational vision-LLMs without relying on additional image-text pairs. This self-supervised approach, leveraging generative diffusion feedback, establishes a versatile paradigm that can be scaled and adapted to various other architectures and data modalities.

Future work could explore the scalability of DIVA by incorporating larger datasets and more complex model architectures to further push the boundaries of visual detail perception. Additionally, integrating DIVA with finer-grained supervision schemes could bolster the discriminative capabilities of CLIP models across a broader spectrum of tasks. Expanding the framework to accommodate video and audio modalities can also be a promising direction for future research, aiming to develop a more comprehensive multi-modal understanding system.

In conclusion, the DIVA framework exemplifies an effective strategy for enhancing the visual perception capabilities of CLIP models, demonstrating both practical benefits and theoretical advancements in self-supervised learning paradigms.