Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 45 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 11 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 88 tok/s Pro

Kimi K2 214 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Language-Image Alignment with Fixed Text Encoders (2506.04209v1)

Published 4 Jun 2025 in cs.CV

Abstract: Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed LLM offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.

Collections

Summary

The paper introduces LIFT, a method that freezes the text encoder and trains only the image encoder to achieve efficient multimodal alignment.
It demonstrates significant compositional reasoning gains, including a 7.4% average accuracy increase on the SugarCrepe benchmark.
LIFT offers improved computational efficiency by reducing per-sample FLOPs by up to 35.7% and lowering GPU memory usage, simplifying training pipelines.

Language-Image Alignment with Fixed Text Encoders: An Expert Overview

This work proposes a paradigm shift in language-image alignment by introducing LIFT (Language-Image alignment with a Fixed Text encoder), which leverages LLMs as fixed, pre-trained text encoders while training only the image encoder for alignment. The central question addressed is whether the prevalent, resource-intensive approach of joint, end-to-end contrastive training for both text and image encoders—as in CLIP—is strictly necessary. By freezing the text encoder, LIFT decouples the image and text branches, with significant implications on efficiency, scalability, and representation quality.

Methodology

LIFT retains the dual-encoder structure typical of CLIP-like models, comprising separate image and text encoders that map images and text into a shared embedding space. However, the LIFT pipeline diverges in the following key aspects:

Text Encoder: Instead of training the text encoder from scratch, LIFT employs a frozen, contrastively fine-tuned LLM-based encoder (e.g., NV-Embed-V2). All textual embeddings are pre-computed offline, making the textual branch entirely static during image encoder training.
Image Encoder: Only the image encoder (a ViT) and a small projection head are trained, with alignment losses computed against the fixed text embeddings.
Loss Functions: In addition to the standard contrastive loss, the paper evaluates a pure cosine similarity loss (without negative samples), possible due to the absence of mode collapse risk when the text encoder is fixed.
Efficiency: The computational advantage is realized as all per-sample language FLOPs and memory are eliminated from the main training loop; only the image encoder consumes resources, and the computational cost of pre-computing text embeddings is amortized and parallelizable.

Algorithmic Structure

Outlined below is the practical training flow for LIFT:

text_embeddings = {caption: LLM_text_encoder(caption) for caption in dataset.captions}

image_encoder = VisionTransformer(...)
proj_head = MLP(...)

for images, captions in dataloader:
    # Corresponding precomputed text embeddings
    batch_text_embeds = [text_embeddings[cap] for cap in captions]
    
    # Obtain image embeddings
    img_embeds = proj_head(image_encoder(images))
    img_embeds = normalize(img_embeds)
    text_embeds = normalize(torch.stack(batch_text_embeds))
    
    # Loss: either contrastive or pure cosine similarity
    # Cosine loss example (for positive pairs only)
    loss = (1 - (img_embeds * text_embeds).sum(dim=1)).mean()  # shape: [batch_size]
    
    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

This decoupled process leads to several implementation and deployment advantages, such as straightforward scaling to longer captions and larger batch sizes, reduced GPU memory footprint, and simplified data pipelines.

Empirical Results and Analysis

Compositional Reasoning

LIFT demonstrates substantial gains in compositional understanding tasks compared to CLIP. On the SugarCrepe benchmark, which focuses on compositional manipulations (object, attribute, relation add/replace/swap), LIFT achieves average accuracy gains of 7.4%. The improvements are particularly strong on object-attribute and object-relation association tasks, indicating better semantic structuring in the embedding space.

Zero-Shot Classification and Retrieval

LIFT matches or exceeds CLIP in zero-shot classification and retrieval tasks when trained on datasets with long, synthetically detailed captions. When trained on short, web-scraped captions, CLIP retains a slight edge in some retrieval setups, but this advantage disappears—often in LIFT's favor—when using rich, synthetic captions. The robustness to caption length and syntactic homogeneity is attributed to the LLM text encoder's semantic richness and resistance to shortcut learning based on syntactic similarity.

Impact of Training Data and Text Encoder Choice

Synthetic vs. Web-Scraped Captions: LIFT is resilient to the “inverse scaling effect” seen in CLIP, where longer, more homogeneous captions harm retrieval performance due to shortcut exploitation.
Encoder Selection: Ablation studies reveal that vanilla LLMs are inadequate without specialization: strong performance depends on contrastive fine-tuning of the text encoder for embedding tasks, while advanced embedding extraction mechanisms yield marginal gains over a simple pooling (e.g., <eos> token representation) for most evaluation protocols.

Computational Efficiency

LIFT reduces per-sample FLOPs by up to 35.7% and memory usage by up to 12.6% (depending on caption length and batch size) compared to CLIP. The key driver is shifting all text processing offline, coupled with the amortization of expensive LLM forward passes.

Loss Function Simplification

With the text encoder fixed, a simple cosine similarity loss over positive pairs is viable. For certain compositional understanding and instruction-following tasks, this loss performs on par with the contrastive loss, obviating the need for large batch sizes and making distributed training easier. However, for classic zero-shot retrieval, the contrastive loss remains superior, presumably due to its negative sampling—critical for discriminative representation.

Implications and Future Directions

Practical

Resource-Aware Training: LIFT is highly attractive for resource-constrained training scenarios, where the bulk of training cost can be shifted to a one-off, highly parallel pre-processing phase (text embedding), enabling high-throughput vision branch optimization.
Rapid Prototyping & Task Adaptation: Given the decoupling, swapping in improved LLM-based text encoders into existing pipelines or re-aligning the vision branch for new captioning domains becomes trivial.
Scenario-Dependent Modular Design: LIFT’s framework supports task-specific trade-offs between retrieval performance and compositional reasoning by choosing loss functions and encoder variants.

Theoretical

LIFT challenges the longstanding assumption that optimal multimodal alignment necessitates joint end-to-end learning; instead, the language branch can be effectively “anchored” by powerful off-the-shelf semantic embeddings. This shifts the focus to:

Tailoring vision encoders to align with semantic, rather than merely syntactic, text representations.
Exploring deeper information-theoretic objectives beyond alignment of lower-order statistics for structure-sensitive tasks.

Limitations and Open Problems

Incomplete Compositionality: Despite improvements, LIFT’s compositional accuracy does not reach ceiling, with swap-based tasks revealing persistent limitations—potentially due to loss functions focusing on pairwise similarities.
Extensibility and Scaling: The scalability of freezing the text encoder for ultra-large data regimes remains untested; potential degradation in scaling laws relative to joint training requires further investigation.
Adaptation to New Domains: As representation learning shifts northward into the semantic abstraction space (LLMs), ensuring coverage and adaptability to highly specialized or non-generic language representations becomes more challenging.

Conclusion

LIFT concretely demonstrates that powerful, fixed LLM-based text encoders can robustly guide visual representation learning, bypassing significant computational burden without sacrificing—and often improving—task performance on compositional and instruction-based VLM benchmarks. This outcome supports a modular, resource-efficient view of multimodal integration, encouraging further exploration of decoupled training strategies, improved semantic alignment objectives, and practical, efficient deployment for large-scale real-world applications.