TOMCap: Text-Only Image Captioning
- TOMCap is a text-only training methodology for image captioning that uses retrieval-augmented prompts to generate captions without aligned image-text pairs.
- It employs a frozen vision-language encoder (CLIP) combined with a GPT2 decoder and modality gap correction to reconcile differences between image and text embeddings.
- The method achieves state-of-the-art performance in zero-shot settings, validated through extensive evaluations on benchmarks like MSCOCO and NoCaps.
TOMCap is a text-only training methodology for image captioning that eliminates the need for aligned image-caption pairs by leveraging a retrieval-augmented prompting approach integrated with @@@@3@@@@ correction. It builds upon a frozen vision-language encoder (CLIP), a LLM decoder (GPT2), and a large-scale caption retrieval system, and it introduces a modality adaptation mechanism to reconcile the representational discrepancy between image and text embeddings. The system demonstrates state-of-the-art performance among zero-shot and text-only image captioning approaches, notably improving the usability of vast unpaired text corpora for generative vision-language tasks (Fonseca et al., 3 Dec 2025).
1. System Architecture
TOMCap operates with a modular architecture comprising a frozen SigLIP2 L/16 CLIP model as encoder, modality-gap correction preprocessing for all latent embeddings, an indexed retrieval datastore of approximately 16 million caption embeddings (drawn from MSCOCO, CC3M, and CC12M), and a pretrained GPT2 LLM decoder (base or large) with minimal trainable modifications. These modifications include multi-head cross-attention layers, one per GPT2 block (whose keys/values are supplied by the corrected CLIP embeddings of the query image and K retrieved captions), as well as LoRA adapters in all self-attention blocks for parameter-efficient fine-tuning.
The typical inference flow processes as follows:
- The input image is encoded into a 1024-dimensional embedding () via CLIP.
- The embedding is corrected for modality gap, yielding .
- Top semantically similar captions are retrieved from the precomputed, corrected-embedding FAISS index.
- A language prompt is formed using those captions.
- The GPT2 decoder, given the prompt and using cross-attention over , generates the output caption.
2. Retrieval-Augmentation Approach
Retrieval-augmentation is a central mechanism in TOMCap, providing explicit in-context information to guide the language generation process. Offline, each caption is encoded as with CLIP's text encoder, corrected for modality, optionally perturbed by Gaussian noise, and stored in an L2-indexed FAISS datastore.
At training, for each caption:
- is corrected to .
- The index returns the nearest (in ) corrected embeddings.
- The single closest caption is used for teacher-forced token prediction.
- The other captions serve as context in the prompt.
At inference:
- An image is encoded and corrected as .
- nearest captions are retrieved.
- The following prompt is constructed: “Similar images have the following captions: {cap} … {cap}. Write a caption for this image:”
- Beam search () determines the caption.
3. Modality Gap Correction
Empirical analysis reveals systematic differences in mean () and standard deviation () across CLIP's per-dimension image and text latent spaces. Let denote sets of embeddings; for each dimension :
- , (images)
- , (texts)
Correction proceeds via mean–variance alignment:
where "orig" is the input modality.
For text embeddings during training:
During both training and inference, Gaussian noise is added: for retrieval, for decoder attention, yielding . This stabilizes training and helps the system generalize across the residual modality gap.
4. Training and Optimization Strategy
TOMCap exclusively uses text during training, sampling captions from MSCOCO, CC3M, and CC12M (totaling 16 million). CLIP image statistics are estimated using Flickr30k images only. All CLIP and GPT2 weights remain frozen; only cross-attention and LoRA adapter weights are trained.
Training optimizes standard cross-entropy over the next-token prediction, with the target caption being the single nearest neighbor retrieved. Initialization and optimization details:
- AdamW optimizer, learning rate 1e-4, default weight decay.
- Batch size 32, up to 10 epochs, early stopping with patience 3 (evaluated on 5% MSCOCO-val every 2048 steps).
- Training is performed on a single NVIDIA RTX 6000 (∼6 hours, 32 GB memory).
5. Empirical Evaluation
Evaluation is conducted on MSCOCO (Karpathy splits with 113 K train, 5 K val, 5 K test) and NoCaps (4,500 validation images with in/out-of-domain splits). Metrics include BLEU-1, BLEU-4, METEOR, and CIDEr (MSCOCO standard evaluation).
Key comparative results for MSCOCO test (CIDEr and BLEU-4; all methods text-only or zero-shot):
| Method | BLEU-4 (%) | CIDEr |
|---|---|---|
| MAGIC | 12.9 | 49.3 |
| MeACap | 17.7 | 84.8 |
| CapDec | 26.4 | 91.8 |
| EntroCap | 27.6 | 94.3 |
| TipCap | 31.4 | 106.6 |
| TOMCap (base GPT2-B) | 28.4 | 103.4 |
| TOMCap (large GPT2-L) | 30.2 | 108.3 |
On NoCaps validation:
- TOMCap (base): CIDEr = 76.2
- TOMCap (large): CIDEr = 76.4
- Competing methods: CapDec+RLCF-S (58–68), ViECap+ToCa (70.9), IFCap (74.0)
6. Ablation and Analysis
Ablation studies assess retrieval, embedding, and prompt variants; the number of retrieved captions ; noise magnitudes; modality-gap correction; and prompt order:
- Captioning with the full TOMCap pipeline gives CIDEr scores noticeably above variants omitting retrieval, cross-attention, or embedding correction.
- Retrieval-only (no training): CIDEr = 15.2
- Retrieval-only (trained, no cross-attention): CIDEr = 101.6
- Embedding-only (no prompt): CIDEr = 76.6
- Full TOMCap: CIDEr = 103.4
- Varying (retrieved captions): Training with and inferring with yields the highest observed marginal gain ( CIDEr).
- Gaussian noise: Small values (, ) improve results; larger values degrade them (CIDEr drops by for ).
- Modality-gap correction: Both mean and standard deviation alignment are essential (no correction: CIDEr = 79.3; mean only: 102.2; full: 103.4).
- Re-ranking retrieved captions via MMR and varying hierarchical order show negligible effect (<0.5 CIDEr).
7. Observed Impact and Methodological Implications
TOMCap achieves high captioning performance among text-only and zero-shot methods by combining a precomputed retrieval-augmented prompt, robust cross-modal alignment, and minimal finetuning via LoRA. The result demonstrates that with careful modality gap correction and retrieval-augmented prompting, LLMs can learn effective vision-language mappings without direct image-caption supervision. The methodology is characterized by its parameter-efficiency, reliance on frozen foundation models, and scalability with unpaired text datasets.
A plausible implication is that TOMCap's approach—joint modality normalization and large-scale retrieval for context—may generalize to other multi-modal tasks where aligned pairs are scarce but large pre-trained models and unpaired domain corpora are available (Fonseca et al., 3 Dec 2025).