Locked-image Tuning (LiT)
- Locked-image Tuning (LiT) is a contrastive method that freezes a pretrained image encoder while exclusively updating the text encoder to align modalities.
- It employs the symmetric InfoNCE loss with a frozen ViT backbone and synthetic text cues to achieve efficient zero-shot and transfer performance.
- LiT demonstrates state-of-the-art results, such as 63.28% top-1 accuracy on iNaturalist and high zero-shot scores on ImageNet, validating its practical scalability.
Locked-image Tuning (LiT) is a contrastive learning regime for visionāLLMs in which the image encoder, typically a high-capacity vision model pretrained on large-scale image classification data, is frozen throughout training. Only the text encoder is updated to align its outputs with the fixed image embeddings in a shared embedding space. This approach leverages the generalization strength of existing vision backbones while efficiently aligning them to text, resulting in state-of-the-art zero-shot classification and robust transfer capabilities, particularly in domains with limited annotated data, such as ecology and agriculture (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).
1. Theoretical Framework and Training Objective
LiT is instantiated within the general contrastive-tuning paradigm for aligning modality-specific encoders into a shared embedding space. Given a frozen image encoder and a fully trainable text encoder , the model receives batches of paired images and corresponding texts . These are encoded and -normalized to produce representations
A pairwise cosine similarity matrix is defined with entries
where is a temperature parameter (learned or fixed). The symmetric InfoNCE loss averaged over both matching directions is employed: Because is fixed, only receives gradient updates; the text encoder learns to map captions into the same space as the pretrained visual features (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).
2. Model Architectures and Data Pipeline
The LiT methodology is agnostic to architecture, but empirical work has focused on Vision Transformer (ViT) models as the image encoder, e.g., ViT-Large or ViT-g/14, and transformer-based text encoders (ViT text tower, BERT, T5, or mT5). The image backbone is always frozen. In practice, features are extracted from the pooled, pre-softmax embedding (class-token output before the final classification head). A linear projection head may be used to adapt the backboneās feature dimension to the embedding size (e.g., 512ā1536).
Text encoders are initialized either randomly or from language-pretraining checkpoints and then trained to completion on paired imageātext data. For domains without aligned captions, synthetic text can be composed using metadata, as exemplified by āA photo of the āØCONCAT_METADATAā©.ā in species detection (Nakkab et al., 2023).
For large-scale training, distributed frameworks such as VLHub (OpenCLIP fork) or TPU clusters orchestrated via SLURM/Singularity are used, allowing batch sizes in the 1kā32k range. Images are preprocessed with standard resize, cropping, and augmentations from ViT/CLIP pipelines; image embeddings can be precomputed to accelerate text-only training, given the frozen image tower (Zhai et al., 2021, Nakkab et al., 2023, Kossen et al., 2023).
3. Optimization, Scheduling, and Training Recipes
LiT training relies exclusively on updating the text tower, typically using AdamW or Adafactor for optimization with default CLIP or transformer settings (weight decay ā 0.1 to 1eā3). Learning rates follow a linear warmup (500ā2000 steps) to a peak value (e.g., or 1eā3), then decay to zero with a cosine schedule. Empirically, 1000 warmup steps produced optimal convergence for species detection tasks, with shorter warmups (e.g., 500 steps) leading to instability and failed convergence (Nakkab et al., 2023).
Batch sizes ranging from 1k to 32k are distributed across GPUs/TPUs. Regularization is minimal: only standard data augmentations (random crops/flips) for images, and typically no additional text augmentation, since gradients do not affect the image encoder. Precomputing image representations significantly improves efficiencyātext tower updates can be 3ā5Ć faster than joint training (Zhai et al., 2021, Nakkab et al., 2023).
4. Benchmark Results and Empirical Findings
LiT achieves strong zero-shot transfer performance on standard and domain-specific benchmarks. Using a frozen ViT-Large for species detection in iNaturalist-2021 (2.7M images, 10,000 species), LiT attained Top-1/Top-5 zero-shot accuracies of 63.28%/87.48%āapproaching or surpassing fully supervised or ImageNet-pretrained ResNet-50 baselinesāwithout updating the vision encoder (Nakkab et al., 2023).
On broader benchmarks, a ViT-g/14 image tower pretrained on JFT-3B, tuned on 4B imageātext pairs, achieves 85.2% zero-shot ImageNet top-1 accuracy, 82.5% on ObjectNet (previous state-of-the-art ā¼72.3%), and high scores across ImageNet variants and VTAB-Natural tasks. Even on public datasets (CC12M+YFCC), LiT attains a marked improvement over OpenCLIP (75.7% vs. ā¼34.8% zero-shot ImageNet). With as little as 1āB images seen in YFCC-CLIP, zero-shot ImageNet accuracy reaches ā¼63.6% (Zhai et al., 2021, Kossen et al., 2023).
A comparative summary of zero-shot classification results is provided below:
| Method | Image Tower | Top-1 (%) | Top-5 (%) |
|---|---|---|---|
| FixMatch | ResNet-50 | 47.9 | ā |
| Fully Supervised | ResNet-50 | 61.6 | 81.8 |
| ImageNet Pretrain | ResNet-50 | 65.4 | 85.1 |
| LiT (Ours) | ViT-Large (frozen) | 63.28 | 87.48 |
5. Robustness, Applicability, and Limitations
LiTās hallmark is strong zero-shot classification, particularly for long-tailed, fine-grained, and resource-limited problems. The method requires only that new classes are represented textually; classification for unseen classes reduces to embedding candidate descriptions and selecting the label with maximal cosine alignment to the image embedding. This approach is especially suitable for domains where imageātext pairs are unavailable but class metadata can be programmatically synthesized.
Robustness to the pretraining source of the image tower is limited; when the frozen encoder is pretrained on a distribution mismatched to downstream data (e.g., Places365 for object recognition tasks), performance collapses in a way not observed in more flexible adapters (e.g., Three Towers/3T). While LiT excels in classification, its retrieval performance is inferior to standard CLIP or 3T methods due to the lack of image encoder adaptation (Kossen et al., 2023). Fine-tuning the image backbone in the contrastive setup, rather than locking, was shown to reduce OOD generality and hurt performance on large-scale, noisy web data (Zhai et al., 2021, Kossen et al., 2023).
6. Extensions, Practical Guidelines, and Research Directions
Subsequent research (e.g., Three Towers) has extended LiT by relaxing the locked constraint and introducing a third, alignment-focused tower, showing improvements for retrieval tasks and robustness to pretraining-source mismatch (Kossen et al., 2023). Nevertheless, LiT remains a baseline for lightweight, zero-shot adaptation with state-of-the-art image models.
For practitioners:
- Always begin with a strong, supervised or self-supervised pretrained vision encoder.
- Freeze the image tower; adapt the text encoder with a contrastive objective.
- Employ synthetic captions or metadata when imageātext pairs are unavailable.
- Use global-softmax and large batch sizes.
- Precompute image embeddings for efficiency.
- For multilingual zero-shot, integrate a multilingual text model but still freeze the image encoder.
- LiT is most optimal for zero- and few-shot classification with a well-aligned frozen vision backbone; it is less suitable for retrieval or OOD scenarios where data distribution mismatch is present.
A plausible implication is that LiT offers a highly scalable and compute-efficient path to deploying powerful visionāLLMs in fields such as agriculture, ecology, and species detectionāenabling real-time, device-based recognition of novel taxa through caption engineering alone, without further model retraining (Nakkab et al., 2023, Zhai et al., 2021, Kossen et al., 2023).