Locked-Image Tuning (LiT)

Updated 18 February 2026

The paper demonstrates that LiT achieves state-of-the-art zero-shot classification, with metrics like 85.2% top-1 accuracy on ImageNet, by training only the text encoder using a contrastive loss.
Locked-Image Tuning (LiT) is a transfer-learning approach that freezes a high-quality, pretrained image encoder while aligning it with a trainable text encoder, enabling efficient adaptation to new tasks.
LiT reduces computational demands by precomputing image embeddings and avoiding gradients through the image tower, resulting in significant memory savings and improved training efficiency.

Locked-Image Tuning (LiT) is a transfer-learning paradigm for vision–LLMs in which a high-quality, pretrained image encoder is "locked" (i.e., its weights are frozen), and only a paired text encoder is trained to align with the frozen vision features using a contrastive objective. LiT enables efficient zero-shot transfer for image classification and retrieval by leveraging rich visual representations acquired during large-scale image pretraining. The method distinctly contrasts with from-scratch vision–language contrastive learning approaches, such as CLIP, by enabling the text interface to adapt to new tasks while preserving the generality and structure of the image embedding space (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).

1. Core Methodology and Architecture

LiT adopts a two-tower architecture in which the image tower, $f_\theta(\cdot)$ , is initialized from a powerful, large-scale pretrained vision model—such as a ViT or ResNet trained on JFT-3B or ImageNet-21k—and remains completely frozen throughout tuning. The text tower, $g_\phi(\cdot)$ , is commonly a Transformer initialized from scratch (or partially pretrained), with its parameters as the only trainable components. Each image-caption pair $(I, T)$ yields embeddings that are L2-normalized and compared using a symmetric contrastive loss.

No modifications are made to the pretrained image tower aside from L2-normalization; projections (typically a single linear head) are appended to both image and text towers to map features to a common embedding space of dimension $D$ (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).

2. Contrastive Learning Objective

The LiT training regime minimizes a bi-directional contrastive loss adapted from the InfoNCE (as in CLIP/ALIGN). For a minibatch of $N$ paired examples, let $f_\theta(I_i)$ and $g_\phi(T_i)$ denote the image and caption embeddings (normalized to unit length). With a learned temperature $\tau > 0$ , the losses are:

$\mathcal{L}_{f \to g} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(f_\theta(I_i)^\top g_\phi(T_i) / \tau)}{\sum_{j=1}^N \exp(f_\theta(I_i)^\top g_\phi(T_j) / \tau)}$

$\mathcal{L}_{g \to f} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(f_\theta(I_i)^\top g_\phi(T_i) / \tau)}{\sum_{j=1}^N \exp(f_\theta(I_j)^\top g_\phi(T_i) / \tau)}$

The final training loss is the average:

$g_\phi(\cdot)$ 0

Gradients update only the text tower and projection layers; the image encoder is not modified (Zhai et al., 2021, Kossen et al., 2023, Nakkab et al., 2023).

3. Training Protocol and Implementation

LiT training is typically conducted over paired image–caption datasets (e.g., CC12M, YFCC100M, web-scale image–alt-text data). When image-only datasets (e.g., iNaturalist-2021) are used, synthetic captions are generated by concatenating available metadata such as class names or taxonomy.

Large-batch distributed training protocols are standard, with global negatives computed across all devices to maximize effective contrastive context. The temperature parameter $g_\phi(\cdot)$ 1 is optimized jointly with the text encoder. The image tower’s embeddings can be precomputed, leading to substantial reduction in GPU memory and training time. A typical LiT setup employs no vision data augmentations during tuning to preserve alignment between pre-computed embeddings and input data, further increasing efficiency (Zhai et al., 2021, Nakkab et al., 2023).

A canonical training pseudocode (from (Nakkab et al., 2023)):

$g_\phi(\cdot)$ 2

4. Performance and Empirical Results

LiT demonstrates state-of-the-art zero-shot classification performance when the pretrained vision encoder is strong and task-relevant. Notably, Zhai et al. report that a ViT-g/14-based LiT model achieves 85.2% zero-shot top-1 accuracy on ImageNet and 82.5% on ObjectNet, outperforming CLIP/ALIGN and nearly matching supervised fine-tuning (Zhai et al., 2021). On the fine-grained, long-tailed iNaturalist-2021 benchmark, LiT tuning of a ViT-Large image encoder, with synthetic captions, attains top-1 accuracy of 63.28%, nearly equaling fully-supervised ResNet50 training (65.4%) (Nakkab et al., 2023).

In retrieval, LiT reaches 41.9% (image→text Recall@1) and 59.3% (text→image Recall@1) on MSCOCO using private large-scale data (Zhai et al., 2021). A plausible implication is that, while classification benefits maximally from strong pretrained features, retrieval performance may be relatively more sensitive to the frozen space’s representational granularity.

Key empirical properties:

Zero-shot accuracy scales with image tower capacity more strongly than with text tower scale.
Cross-architecture robustness: LiT applies to ViT, ResNet, and Mixer pretraining, though ViTs yield the strongest alignment.
Model ablations indicate that freezing the image encoder generally outperforms full fine-tuning or simultaneous contrastive tuning, except under extremely large-scale training budgets (Zhai et al., 2021).

5. Analysis, Applications, and Limitations

LiT excels when the pretrained image encoder already provides semantically rich, generalizable visual features. The frozen encoder confers strong few-shot and zero-shot classification performance with significant memory and compute savings—no gradients through the vision tower, re-use of precomputed embeddings, efficient training of large text towers, and robust OOD generalization (Zhai et al., 2021, Nakkab et al., 2023).

On fine-grained and long-tailed datasets, freezing the image tower avoids overfitting to rare classes, reduces sample complexity, and leverages the generality of the vision backbone. In the iNaturalist-2021 setting, efficient language alignment alone can yield new state-of-the-art VL zero-shot accuracy with an order of magnitude reduction in computational cost relative to full fine-tuning (Nakkab et al., 2023).

However, a noted limitation is reduced flexibility: the frozen image features cannot adapt to novel domains or label sets that deviate substantially from the pretraining distribution. Retrieval tasks may underperform compared to from-scratch CLIP-style models given sufficiently large compute. LiT’s performance can collapse when the pretraining labels or data coverage are mismatched to the downstream task (e.g., Places365 pretraining on ImageNet benchmarks) (Kossen et al., 2023).

6. Comparison to Three-Tower Architectures and Evolution

The Three Towers (3T) method (Kossen et al., 2023) generalizes LiT by introducing a third, frozen image embedding tower and corresponding contrastive distillation losses. Unlike LiT, which "locks" the image tower, 3T enables a learnable image tower to benefit from both scratch contrastive training and knowledge distillation from a frozen pretrained embedding space. This strategy provides greater robustness and can improve retrieval, especially when pretrained features are imperfect or not fully aligned with downstream tasks.

Empirical results indicate that:

LiT attains superior classification when pretrained embeddings cover downstream labels comprehensively and training scale is large.
3T is more robust to mismatched or narrow-domain pretraining, preventing catastrophic performance drops, and generally delivers higher retrieval accuracy (Kossen et al., 2023).

A plausible implication is that architectural flexibility and continual knowledge distillation, as in 3T, may become more advantageous as tasks and pretraining regimes diversify.

7. Notable Applications and Future Directions

LiT has been applied effectively to challenging, fine-grained, and long-tailed benchmarks such as iNaturalist-2021 for species detection, leveraging synthesized metadata captions and fixed vision backbones (Nakkab et al., 2023). The recipe has proven scalable to large web-scale datasets and across architectures. Preliminary work with multilingual text towers shows promising cross-lingual zero-shot transfer (Zhai et al., 2021). Limitations of current research include restricted evaluation domains—LiT has not been extensively assessed on detection, segmentation, VQA, or captioning—and open questions remain regarding prompt engineering, open-vocabulary bias, and hybrid (freeze→unfreeze) training schedules.

A plausible implication is that future extensions may include hierarchical freezing strategies, more flexible loss scheduling, or integration with external knowledge to further boost generalization and applicability across modalities.