TULIP: Towards Unified Language-Image Pretraining (2503.15485v2)

Published 19 Mar 2025 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-LLMs, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at https://tulip-berkeley.github.io

PDF Abstract

TULIP: Towards Unified Language-Image Pretraining

The TULIP model aims to bridge the gap between the semantic alignment capabilities of image-text contrastive models like CLIP and SigLIP, and the fine-grained visual understanding often lacking in these approaches but present in vision-centric models "TULIP: Towards Unified Language-Image Pretraining" (Tang et al., 19 Mar 2025 ). Standard contrastive pretraining prioritizes aligning global image and text representations, which can inadvertently suppress the learning of detailed visual features crucial for tasks like counting, depth estimation, or fine-grained object recognition. TULIP introduces several modifications to the pretraining objective to learn richer, more detailed visual representations while maintaining strong language alignment, positioning itself as an enhanced, open-source drop-in replacement for existing CLIP-style models.

Methodology

TULIP builds upon the foundation of contrastive image-text pretraining but incorporates several key enhancements designed to improve fine-grained visual feature learning. The core components include:

Generative Data Augmentation: Unlike standard geometric or photometric augmentations, TULIP employs generative methods to create more diverse and potentially challenging positive examples during training. This likely involves using generative models (e.g., diffusion models, GANs) conditioned on either the image or text to synthesize new views or related concepts, pushing the encoders to learn more robust and detailed features that go beyond simple invariances.
Enhanced Contrastive Learning: The standard image-text contrastive loss ( $L_{I-T}$ $L_{I - T}$ ) aligns representations across modalities. TULIP supplements this with intra-modal contrastive objectives:
- Image-Image Contrastive Loss ( $L_{I-I}$ ): This loss encourages different augmented views of the same image to have similar representations while being dissimilar from views of other images. This objective, often used in self-supervised visual learning (e.g., SimCLR, MoCo), explicitly forces the vision encoder to learn invariances useful for visual understanding, potentially capturing finer details missed by the cross-modal objective alone. The total loss might look like $L = L_{I-T} + \lambda_{I} L_{I-I} + \lambda_{T} L_{T-T} + \dots$ , where $\lambda$ are weighting coefficients.
- Text-Text Contrastive Loss ( $L_{T-T}$ ): Similarly, this loss operates on augmented versions of the text captions (e.g., back-translation, synonym replacement, paraphrasing). It encourages the text encoder to produce consistent representations for semantically similar texts, potentially improving the nuance and robustness of the learned text embeddings.
Image/Text Reconstruction Regularization: Drawing inspiration from masked autoencoders (MAE) or similar reconstruction-based methods, TULIP incorporates regularization terms that force the model to reconstruct parts of the input.
- Image Reconstruction ( $L_{REC\_I}$ ): A portion of the image (e.g., patches) might be masked, and the vision encoder's output features are used to predict the masked content, possibly via a lightweight decoder head during pretraining. This encourages the encoder to capture low-level pixel information and spatial relationships.
- Text Reconstruction ( $L_{REC\_T}$ ): Analogously, tokens within the text input might be masked, and the text encoder's features are used to predict the masked tokens. This acts similarly to Masked LLMing (MLM) in BERT, promoting a deeper understanding of language structure and semantics.

The overall objective function combines these components, likely with hyperparameters to balance their contributions:

$L_{total} = L_{I-T} + \lambda_{I} L_{I-I} + \lambda_{T} L_{T-T} + \gamma_{I} L_{REC\_I} + \gamma_{T} L_{REC\_T}$

The architecture typically involves separate image and text encoders (e.g., ViT for images, Transformer for text) projected into a shared embedding space for contrastive learning. The reconstruction losses might operate on the encoder outputs before projection or require separate decoders during pretraining, which are discarded during inference. The model has been scaled up to over 1 billion parameters, suggesting the use of large Vision Transformer (ViT) backbones (e.g., ViT-L or ViT-H variants) and corresponding text transformers.

Training and Evaluation

TULIP models are pretrained on large-scale datasets of image-text pairs, typical for CLIP-like models, likely encompassing billions of pairs scraped from the web. Training such large models necessitates significant computational resources, involving distributed training across numerous GPUs or TPUs for extended periods.

Evaluation focuses on demonstrating the improved fine-grained visual understanding while retaining strong semantic alignment. Key benchmarks and results reported include:

Zero-Shot Classification: Achieved state-of-the-art (SOTA) performance on ImageNet-1K zero-shot classification, indicating strong generalization and robust image-text alignment comparable to or exceeding top models like SigLIP.
Few-Shot/Linear Probing: On RxRx1, a dataset focused on identifying cellular perturbations (a fine-grained biological imaging task), TULIP showed up to a 2x improvement over SigLIP in linear probing performance. This result strongly supports the claim that the enhanced pretraining objectives successfully improve the learning of fine-grained visual features applicable to specialized domains.
Vision-Language Understanding: Evaluated on MMVP (Massively Multi-domain Vision-language Prompting), a benchmark designed to assess detailed visual understanding through targeted prompting (e.g., "How many objects of type X are in the image?"). TULIP achieved scores over 3x higher than SigLIP, highlighting significantly improved capabilities on tasks requiring precise visual analysis guided by language, such as counting or spatial relationship understanding.

These results suggest that the combination of enhanced contrastive losses and reconstruction regularization effectively addresses the trade-off between high-level semantics and low-level visual detail inherent in standard contrastive learning.

Practical Implementation and Applications

TULIP is designed as a drop-in replacement for existing CLIP or SigLIP models. Practitioners can leverage the pretrained TULIP checkpoints for various downstream tasks without significant architectural changes to their existing pipelines.

Feature Extraction: The image and text encoders can be used independently to extract rich embeddings for tasks like image retrieval, clustering, or text classification.

import torch
import tulip # Assuming a library similar to 'clip' or 'open_clip'

# Load the TULIP model and processor
model, preprocess = tulip.load("TULIP-B-16", device="cuda") # Example model name

# Process image
image = preprocess(Image.open("example.jpg")).unsqueeze(0).to("cuda")
with torch.no_grad():
    image_features = model.encode_image(image) # Shape: [1, embedding_dim]

# Process text
text = tulip.tokenize(["a photo of a cat", "a photo of a dog"]).to("cuda")
with torch.no_grad():
    text_features = model.encode_text(text) # Shape: [2, embedding_dim]

# Normalize features for similarity calculation
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Calculate similarity
similarity = (image_features @ text_features.T) # Cosine similarity

Zero-Shot Classification: Use the model directly for classification by comparing image features against text features generated from class prompts (e.g., "a photo of a {class_name}"). TULIP's SOTA ImageNet performance suggests strong out-of-the-box capabilities.
Linear Probing / Fine-tuning: The pretrained backbone provides a powerful starting point for transfer learning. Linear probes (training a linear classifier on frozen features) or full fine-tuning can be applied for specific downstream tasks, especially those requiring fine-grained distinctions where TULIP is expected to excel (e.g., medical imaging, species identification, defect detection). The RxRx1 results demonstrate its effectiveness in few-shot scenarios using linear probing.
Vision-LLMs: TULIP's encoders can serve as inputs for more complex multimodal models (e.g., VQA systems, image captioning models). Its improved performance on MMVP suggests it provides better-grounded visual information for language-conditioned tasks.
Deployment Considerations: Models scaled to 1B parameters require substantial memory and compute for inference. Quantization, pruning, or distillation techniques might be necessary for deployment in resource-constrained environments. The availability of multiple model sizes (if provided, akin to CLIP's ViT-B, ViT-L variants) would offer trade-offs between performance and efficiency.

The open-source release of code and checkpoints is crucial for adoption, allowing researchers and engineers to directly integrate TULIP into their applications and build upon its improved representational capabilities.

Conclusion

TULIP presents a refined approach to image-text pretraining that explicitly targets the limitations of prior contrastive methods in capturing fine-grained visual detail. By integrating generative augmentation, enhanced intra-modal contrastive learning, and reconstruction regularization, it successfully learns richer visual features without compromising the strong semantic alignment crucial for zero-shot generalization. The reported SOTA results on ImageNet zero-shot and significant improvements on fine-grained (RxRx1) and detailed vision-language tasks (MMVP) validate its effectiveness. As an open-source, drop-in replacement, TULIP offers a potentially superior foundation model for a wide range of vision and vision-language applications demanding both semantic understanding and detailed visual acuity.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zineng Tang (13 papers)
Long Lian (16 papers)
Seun Eisape (1 paper)
Xudong Wang (113 papers)
Roei Herzig (34 papers)
Adam Yala (13 papers)
Alane Suhr (28 papers)
Trevor Darrell (324 papers)
David M. Chan (30 papers)

Related Papers

Find Related Papers

GitHub

TULIP: Towards Unified Language-Image Pretraining

Tweets

https://twitter.com/ZinengTang/status/1902818172513984795

https://twitter.com/_dmchan/status/1902827541372735610

https://twitter.com/_akhaliq/status/1902873308296556721

https://twitter.com/iScienceLuvr/status/1902589707927552094

https://twitter.com/papers_anon/status/1902595483639374298

https://twitter.com/ceobillionaire/status/1903643827841200332