Jina CLIP: Your CLIP Model Is Also Your Text Retriever (2405.20204v2)

Published 30 May 2024 in cs.CL, cs.AI, cs.CV, and cs.IR

Abstract: Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

PDF HTML Abstract

JINA CLIP: Your CLIP Model Is Also Your Text Retriever

In this paper, the authors present a novel approach to contrastive training with large-scale image-caption and text pairs, addressing the limitations of traditional CLIP models in handling text-only tasks. The proposed method jointly optimizes text-image and text-text representation alignment, leading to the development of the jina-clip-v1 model.

Traditional CLIP models, such as the one introduced by OpenAI, align images and their corresponding captions within a shared embedding space. However, these models often struggle with text-only tasks due to their limited context length and training on short image captions. This deficiency creates inefficiencies in information retrieval systems that need separate embeddings and models for text-only and multimodal tasks.

Methodology

The authors propose a multi-task contrastive training method to address these issues. The text encoder uses the JinaBERT architecture, augmented to support longer texts through AliBi. The image encoder is based on the EVA02 architecture. The model is pre-trained using the Masked LLMing objective from BERT, yielding superior performance compared to other initialization methods.

The training is executed in three stages:

Stage 1: Focuses on aligning image and text representations using short human-made captions and text-text pairs of up to 77 tokens.
Stage 2: Introduces longer synthetic captions to the model while continuing text-text training with extended context length of up to 512 tokens.
Stage 3: Incorporates hard negatives to improve text encoder performance, further enhancing relevance separation.

The training data comprises diverse datasets for text-pair and text-image matching, including LAION-400M, ShareGPT4V, and curated text triplets from MSMarco, NQ, HotpotQA, and NLI.

Results

The model's performance is evaluated on the CLIP and MTEB benchmarks. Notable results include:

CLIP Benchmark: jina-clip-v1 achieves an average Recall@5 of 85.8% across zero-shot retrieval tasks, outperforming OpenAI’s CLIP and matching EVA-CLIP's performance with significantly less training data.
MTEB Benchmark: The text encoder achieves an average score of 60.12%, closely competing with top-tier text-only models and surpassing other CLIP models by approximately 15% in overall performance and 22% in retrieval tasks.

The specific performance metrics on the CLIP Benchmark reveal substantial gains in both zero-shot image and text retrieval, and competitive classification accuracy across diverse datasets. Notably, for zero-shot text-to-image Recall@5, jina-clip-v1 attains 80.31%, surpassing OpenAI CLIP (75.62%) and nearing LongCLIP (81.72%).

Additionally, the model showcases strong performance on the MTEB Benchmark encompassing multiple tasks like classification, clustering, reranking, retrieval, sentence similarity (STS), and summarization. For instance, the model achieves an average accuracy of 72.05% in classification tasks, which is comparable to dedicated text models.

Implications and Future Work

The introduction of jina-clip-v1 has significant implications for multimodal information retrieval systems by providing a unified model capable of handling both text-image and text-only tasks efficiently. This approach can mitigate the need for maintaining separate models, leading to resource and computational savings in practical applications.

Future work should aim to extend this model's capabilities to multilingual contexts to cater to a broader range of applications, given the current limitation to English-language texts. Further research could explore integrating more diverse and extensive datasets for continued performance improvement, and potentially adapting similar methodologies to other multimodal scenarios.

Overall, the proposed methodology and resulting model, jina-clip-v1, represent a meaningful advancement in the field of contrastive learning, offering a robust solution for multimodal and text-only information retrieval tasks.