Long-CLIP: Unlocking the Long-Text Capability of CLIP (2403.15378v3)

Published 22 Mar 2024 in cs.CV

Abstract: Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a LLM supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.

PDF HTML Abstract

An Analysis of Long-CLIP: Enhancing CLIP's Long-Text Capabilities

The paper "Long-CLIP: Unlocking the Long-Text Capability of CLIP" presents an advanced framework designed to extend the capabilities of Contrastive Language-Image Pre-training (CLIP) by accommodating longer text inputs. CLIP, a foundational multimodal model, has been instrumental in tasks such as zero-shot classification, text-image retrieval, and text-to-image generation. However, its limitation in processing longer text descriptions has constrained its usability. This paper addresses these limitations by proposing Long-CLIP, a significant modification enhancing CLIP's ability to process detailed textual data.

Key Contributions

The primary limitation of the original CLIP model is its text input restriction to 77 tokens, resulting in an effective length of less than 20 tokens. This inadequacy hampers CLIP’s ability to manage detailed descriptions essential for nuanced image retrieval and generation tasks. Long-CLIP emerges as a viable solution, providing a plug-and-play replacement for CLIP, which supports extensive text inputs and further improves its zero-shot generalizability.

The research introduces two innovative strategies:

Knowledge-Preserved Stretching of Positional Embedding: This strategy involves retaining the first 20 well-trained positional embeddings while applying interpolation to extend beyond 77 tokens without disrupting existing representations. This method crucially maintains the original capabilities of CLIP.
Primary Component Matching of CLIP Features: Here, the model aligns both fine-grained image features with long descriptions and coarse-grained features with short text summaries. This dual alignment ensures the model captures comprehensive details while prioritizing essential aspects.

Experimental Insights

The authors conducted extensive experiments to validate Long-CLIP's performance. Leveraging an additional million long text-image pairs, they realized a marked enhancement in retrieval tasks, with significant improvements of 20% in long-caption image retrieval and 6% in traditional tasks like COCO and Flickr30k. The model's zero-shot classification ability remained unaffected despite handling longer input texts.

Notably, the research highlighted Long-CLIP's ability to maintain the latent space alignment of the original CLIP, enabling seamless integration into existing frameworks without further modification. This capability was particularly demonstrated in text-to-image generation tasks using Stable Diffusion, where Long-CLIP outperformed its predecessor by incorporating more detailed prompts in the generation process.

Implications and Future Directions

The implications of Long-CLIP are twofold. Practically, it expands the scope of CLIP-enabled applications by effectively handling long and intricate text descriptions, thereby enhancing the utility in detailed image annotations, retrieval, and generation scenarios. Theoretically, the devised techniques of positional embedding stretching and component matching offer a template for similar enhancements in other multimodal models confronting context-length limitations.

Future developments could explore further scalability and potential applications of Long-CLIP in more complex multimodal settings. Given the burgeoning demand for sophisticated AI models capable of processing comprehensive text inputs, Long-CLIP represents a substantial step forward, demonstrating the continued evolution of CLIP-based architectures in addressing the intricate challenges posed by multimodal learning.

In conclusion, the introduction of Long-CLIP marks a substantial advancement in the landscape of multimodal models, providing a robust solution to the prevalent challenge of limited text input capabilities within CLIP. The methodological innovations and empirical accomplishments of this paper lay the groundwork for subsequent explorations into enhanced text-image alignment techniques.