Locked-image Tuning (LiT)

Updated 30 November 2025

Locked-image Tuning (LiT) is a contrastive method that freezes a pretrained image encoder while exclusively updating the text encoder to align modalities.
It employs the symmetric InfoNCE loss with a frozen ViT backbone and synthetic text cues to achieve efficient zero-shot and transfer performance.
LiT demonstrates state-of-the-art results, such as 63.28% top-1 accuracy on iNaturalist and high zero-shot scores on ImageNet, validating its practical scalability.

Locked-image Tuning (LiT) is a contrastive learning regime for vision–LLMs in which the image encoder, typically a high-capacity vision model pretrained on large-scale image classification data, is frozen throughout training. Only the text encoder is updated to align its outputs with the fixed image embeddings in a shared embedding space. This approach leverages the generalization strength of existing vision backbones while efficiently aligning them to text, resulting in state-of-the-art zero-shot classification and robust transfer capabilities, particularly in domains with limited annotated data, such as ecology and agriculture (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).

1. Theoretical Framework and Training Objective

LiT is instantiated within the general contrastive-tuning paradigm for aligning modality-specific encoders into a shared embedding space. Given a frozen image encoder $f(\cdot; \theta_I)$ and a fully trainable text encoder $g(\cdot; \theta_T)$ , the model receives batches of paired images $\{x_i\}_{i=1}^B$ and corresponding texts $\{y_i\}_{i=1}^B$ . These are encoded and $\ell_2$ -normalized to produce representations

$\hat{v}_i = \frac{f(x_i; \theta_I)}{\|f(x_i; \theta_I)\|}, \qquad \hat{t}_i = \frac{g(y_i; \theta_T)}{\|g(y_i; \theta_T)\|}.$

A pairwise cosine similarity matrix $S \in \mathbb{R}^{B \times B}$ is defined with entries

$S_{ij} = \frac{1}{\tau} \langle \hat{v}_i, \hat{t}_j \rangle,$

where $\tau$ is a temperature parameter (learned or fixed). The symmetric InfoNCE loss averaged over both matching directions is employed: $L(\theta_T) = \frac{1}{2B} \left[ \sum_{i=1}^B -\log \frac{\exp(S_{ii})}{\sum_{j=1}^B \exp(S_{ij})} + \sum_{j=1}^B -\log \frac{\exp(S_{jj})}{\sum_{i=1}^B \exp(S_{ij})} \right].$ Because $\theta_I$ is fixed, only $\theta_T$ receives gradient updates; the text encoder learns to map captions into the same space as the pretrained visual features (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).

2. Model Architectures and Data Pipeline

The LiT methodology is agnostic to architecture, but empirical work has focused on Vision Transformer (ViT) models as the image encoder, e.g., ViT-Large or ViT-g/14, and transformer-based text encoders (ViT text tower, BERT, T5, or mT5). The image backbone is always frozen. In practice, features are extracted from the pooled, pre-softmax embedding (class-token output before the final classification head). A linear projection head may be used to adapt the backbone’s feature dimension to the embedding size (e.g., 512–1536).

Text encoders are initialized either randomly or from language-pretraining checkpoints and then trained to completion on paired image–text data. For domains without aligned captions, synthetic text can be composed using metadata, as exemplified by “A photo of the ⟨CONCAT_METADATA⟩.” in species detection (Nakkab et al., 2023).

For large-scale training, distributed frameworks such as VLHub (OpenCLIP fork) or TPU clusters orchestrated via SLURM/Singularity are used, allowing batch sizes in the 1k–32k range. Images are preprocessed with standard resize, cropping, and augmentations from ViT/CLIP pipelines; image embeddings can be precomputed to accelerate text-only training, given the frozen image tower (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).

3. Optimization, Scheduling, and Training Recipes

LiT training relies exclusively on updating the text tower, typically using AdamW or Adafactor for optimization with default CLIP or transformer settings (weight decay ≃ 0.1 to 1e–3). Learning rates follow a linear warmup (500–2000 steps) to a peak value (e.g., $\eta_{\max} \sim 5 \times 10^{-4}$ or 1e–3), then decay to zero with a cosine schedule. Empirically, 1000 warmup steps produced optimal convergence for species detection tasks, with shorter warmups (e.g., 500 steps) leading to instability and failed convergence (Nakkab et al., 2023).

Batch sizes ranging from 1k to 32k are distributed across GPUs/TPUs. Regularization is minimal: only standard data augmentations (random crops/flips) for images, and typically no additional text augmentation, since gradients do not affect the image encoder. Precomputing image representations significantly improves efficiency—text tower updates can be 3–5× faster than joint training (Zhai et al., 2021, Nakkab et al., 2023).

4. Benchmark Results and Empirical Findings

LiT achieves strong zero-shot transfer performance on standard and domain-specific benchmarks. Using a frozen ViT-Large for species detection in iNaturalist-2021 (2.7M images, 10,000 species), LiT attained Top-1/Top-5 zero-shot accuracies of 63.28%/87.48%—approaching or surpassing fully supervised or ImageNet-pretrained ResNet-50 baselines—without updating the vision encoder (Nakkab et al., 2023).

On broader benchmarks, a ViT-g/14 image tower pretrained on JFT-3B, tuned on 4B image–text pairs, achieves 85.2% zero-shot ImageNet top-1 accuracy, 82.5% on ObjectNet (previous state-of-the-art ∼72.3%), and high scores across ImageNet variants and VTAB-Natural tasks. Even on public datasets (CC12M+YFCC), LiT attains a marked improvement over OpenCLIP (75.7% vs. ∼34.8% zero-shot ImageNet). With as little as 1 B images seen in YFCC-CLIP, zero-shot ImageNet accuracy reaches ∼63.6% (Zhai et al., 2021, Kossen et al., 2023).

A comparative summary of zero-shot classification results is provided below:

Method	Image Tower	Top-1 (%)	Top-5 (%)
FixMatch	ResNet-50	47.9	—
Fully Supervised	ResNet-50	61.6	81.8
ImageNet Pretrain	ResNet-50	65.4	85.1
LiT (Ours)	ViT-Large (frozen)	63.28	87.48

(Nakkab et al., 2023)

5. Robustness, Applicability, and Limitations

LiT’s hallmark is strong zero-shot classification, particularly for long-tailed, fine-grained, and resource-limited problems. The method requires only that new classes are represented textually; classification for unseen classes reduces to embedding candidate descriptions and selecting the label with maximal cosine alignment to the image embedding. This approach is especially suitable for domains where image–text pairs are unavailable but class metadata can be programmatically synthesized.

Robustness to the pretraining source of the image tower is limited; when the frozen encoder is pretrained on a distribution mismatched to downstream data (e.g., Places365 for object recognition tasks), performance collapses in a way not observed in more flexible adapters (e.g., Three Towers/3T). While LiT excels in classification, its retrieval performance is inferior to standard CLIP or 3T methods due to the lack of image encoder adaptation (Kossen et al., 2023). Fine-tuning the image backbone in the contrastive setup, rather than locking, was shown to reduce OOD generality and hurt performance on large-scale, noisy web data (Zhai et al., 2021, Kossen et al., 2023).

6. Extensions, Practical Guidelines, and Research Directions

Subsequent research (e.g., Three Towers) has extended LiT by relaxing the locked constraint and introducing a third, alignment-focused tower, showing improvements for retrieval tasks and robustness to pretraining-source mismatch (Kossen et al., 2023). Nevertheless, LiT remains a baseline for lightweight, zero-shot adaptation with state-of-the-art image models.

For practitioners:

Always begin with a strong, supervised or self-supervised pretrained vision encoder.
Freeze the image tower; adapt the text encoder with a contrastive objective.
Employ synthetic captions or metadata when image–text pairs are unavailable.
Use global-softmax and large batch sizes.
Precompute image embeddings for efficiency.
For multilingual zero-shot, integrate a multilingual text model but still freeze the image encoder.
LiT is most optimal for zero- and few-shot classification with a well-aligned frozen vision backbone; it is less suitable for retrieval or OOD scenarios where data distribution mismatch is present.

A plausible implication is that LiT offers a highly scalable and compute-efficient path to deploying powerful vision–LLMs in fields such as agriculture, ecology, and species detection—enabling real-time, device-based recognition of novel taxa through caption engineering alone, without further model retraining (Nakkab et al., 2023, Zhai et al., 2021, Kossen et al., 2023).