ClipCap: Efficient Image Captioning
- ClipCap is a vision-language model that leverages a frozen CLIP encoder and GPT-2, bridged by a learned prefix mapping, to generate image captions efficiently.
- Its architecture offers two mapping variants—MLP for fine-tuning and Transformer for frozen GPT-2—ensuring parameter efficiency and rapid training.
- The model achieves competitive performance on datasets like COCO, Conceptual Captions, and nocaps while substantially reducing GPU compute time compared to prior methods.
ClipCap (“CLIP Prefix for Image Captioning”) is a vision-LLM that addresses image captioning through a simple and parameter-efficient framework. ClipCap leverages frozen pre-trained CLIP visual encoders and GPT-2 LLMs, bridging them by learning a mapping from CLIP image representations to a learned “prefix” compatible with GPT-2’s embedding space. This architectural separation yields a lightweight, rapidly trainable captioning model that requires no additional object detectors or specialized visual annotations, and achieves competitive results on benchmarks such as Conceptual Captions, COCO, and nocaps (Mokady et al., 2021).
1. Model Architecture
ClipCap’s architecture comprises three main components: a frozen CLIP encoder, a mapping network, and a LLM (GPT-2). An input image is first encoded into a visual embedding (e.g., for ViT-B/32 or for RN50). The mapping network projects to a sequence of “prefix” vectors , where is GPT-2’s hidden size (e.g., for GPT-2 Small).
The LLM component—standard GPT-2, either frozen or fine-tuned—receives as input the concatenation of this prefix and the target caption’s token embeddings. The prefix acts as context, “steering” GPT-2 to conditionally generate image-appropriate captions. The forward path is:
Both mapping and decoding steps are designed for maximum reuse of pre-trained knowledge, minimizing the trainable parameter count and enabling fast convergence (Mokady et al., 2021).
2. Mapping Network Designs
Two mapping network variants are implemented: a multilayer perceptron (MLP) and a Transformer-based mapping.
- MLP Variant: Used when GPT-2 is fine-tuned. The mapping is:
with , . Nonlinearity is either ReLU or Tanh.
- Transformer Variant: Used when GPT-2 remains frozen. Incorporates a learned constant matrix and self-attention layers, mapping the visual embedding and the constant to tokens in . Layer computations include pre-norm LayerNorm, multi-head attention, and two-layer MLPs. The final output is projected with to the GPT-2 hidden space.
Both architectures incorporate dropout (rate $0.1$), AdamW optimizer with weight decay, and pre-norm LayerNorm. Selection between MLP and Transformer mapping is governed by whether GPT-2 is trainable or frozen, as MLP is sufficient for the learned context in the fine-tuning regime, whereas the Transformer mapping is more expressive and necessary when GPT-2 parameters are fixed (Mokady et al., 2021).
3. Training Procedure and Data
The model is trained to minimize cross-entropy loss over caption tokens, using the following objective:
where is the number of samples and is the caption length.
Datasets include:
- Conceptual Captions: 3M image-caption pairs from the web, with a 12.5K validation set.
- COCO Karpathy split: Standard train/validation/test images and five captions per image.
- nocaps: Consists of COCO-train images, evaluating on in-domain, near-domain, and out-domain sets.
Captions are tokenized with GPT-2’s byte-pair encoding tokenizer and padded/truncated to length . Images are preprocessed per CLIP’s input requirements (e.g., resizing to and normalization to CLIP statistics).
Training schedules employ AdamW (, ), initial learning rate , linear warm-up ($5$k steps), batch size $40$, and task-appropriate epochs: $10$ for Conceptual Captions, for COCO/nocaps. Model convergence on Conceptual Captions requires roughly $80$ GPU·h (frozen or fine-tuned) on a GTX 1080 Ti, while COCO/nocaps converge within GPU·h (Mokady et al., 2021).
4. Evaluation, Ablation, and Results
ClipCap’s performance is evaluated using standard metrics: ROUGE-L, CIDEr, SPICE, BLEU-4, and METEOR, across Conceptual Captions, nocaps, and COCO.
<table> <thead> <tr> <th>Dataset</th> <th>ClipCap Variant</th> <th>Key Metrics</th> <th>Trainable Params</th> <th>GPU·h</th> </tr> </thead> <tbody> <tr> <td>Conceptual Captions</td> <td>MLP + GPT-2 fine-tune</td> <td>ROUGE-L: 26.71,<br> CIDEr: 87.26,<br> SPICE: 18.50</td> <td\>156M</td> <td\>80</td> </tr> <tr> <td>Conceptual Captions</td> <td>Transformer + frozen GPT-2</td> <td>ROUGE-L: 25.12,<br> CIDEr: 71.82,<br> SPICE: 16.07</td> <td\>43M</td> <td\>72</td> </tr> <tr> <td>nocaps (overall)</td> <td>Transformer + frozen GPT-2</td> <td>CIDEr: 65.83,<br> SPICE: 10.86</td> <td\>43M</td> <td\>6</td> </tr> <tr> <td>COCO (test)</td> <td>Transformer + frozen GPT-2</td> <td>B@4: 33.53,<br> METEOR: 27.45,<br> CIDEr: 113.08, <br>SPICE: 21.05</td> <td\>43M</td> <td\>6</td> </tr> </tbody> </table>
Compared to prior models (e.g., VLP, Oscar, BUTD), ClipCap is within points of state-of-the-art metrics while decreasing compute cost by one to two orders of magnitude (e.g., VLP on Conceptual Captions: 1,200 GPU·h vs. ClipCap’s 72–80 GPU·h).
Ablation findings indicate:
- Prefix length : For MLP+fine-tune, performance saturates at , whereas Transformer+frozen variant improves up to . E.g., on COCO, yields B@4=28.7/CIDEr=101.0, while gives B@4=34.24/CIDEr=115.07.
- GPT-2 fine-tuning: Boosts results on datasets with diverse captioning styles (e.g., +15.4 CIDEr for Conceptual Captions). On more uniform datasets, frozen GPT-2 is competitive or slightly superior, and is favored for parameter efficiency.
- Mapping architecture: The MLP is sufficient when GPT-2 is fine-tuned but inadequate with frozen GPT-2. The Transformer mapping is essential in the frozen setting.
- Prefix interpretability: Decoding prefix vectors by nearest-neighbor in GPT-2’s vocabulary shows semantic alignment (“motorcycle”, “showcase", etc.) when GPT-2 is tuned; with frozen GPT-2, outputs are gibberish, suggesting prefix vectors combine visual content and “steering” information simultaneously (Mokady et al., 2021).
5. Advantages and Limitations
ClipCap provides several empirical and architectural advantages:
- Simplicity: Requires no auxiliary object detectors or bounding-box supervision.
- Efficiency: Trains in $6$–$80$ GPU·h versus hundreds or thousands for some prior art.
- Parameter-efficiency: The lightest variant uses only 43M trainable parameters.
- Competitiveness: Results are within 1–3 points of the strongest models on diverse and open-ended captioning datasets.
However, limitations remain:
- Dependence on CLIP’s recognition: Performance is limited by CLIP’s visual encoding; objects not identified by CLIP are omitted in generated captions.
- No sequence-level metric optimization: CIDEr-RL (e.g., self-critical training) is not used and could improve corpus-level metrics.
- Autoregressive decoding: Restricted to sequential (GPT-style) language generation; non-autoregressive or bidirectional decoders such as BART are noted for future paper (Mokady et al., 2021).
6. Interpretation and Broader Impact
ClipCap demonstrates that prefix-based “steering” of pre-trained LLMs with visual context is effective for vision-language tasks such as image captioning. The approach is parameter- and compute-efficient, enabling rapid experimentation and large-scale deployment without the need to fine-tune massive encoders or decoders.
A plausible implication is that prefix-based conditioning could generalize to broader multimodal tasks, provided the mapping network can align modality-specific features into the required input space of powerful, frozen decoders. The current design reveals both advantages (modular, interpretable, efficient) and constraints (reliance on the quality of frozen encoders, limited to the expressivity of prefix representations).
7. Comparison with Prior Art
ClipCap is closely related to models that use large-scale pre-trained representations such as CLIP for visual encoding and pre-trained Transformers for text generation. Notably, it dispenses with the need for joint training of both modalities, instead using a learned mapping as a bridge. Compared to VLP and Oscar, which involve training with object detectors or larger multimodal pre-training, ClipCap offers a simpler pipeline while achieving comparable results with substantially fewer resources.
The use of a prefix-mapping approach as a modality adapter contrasts with methods relying on joint fine-tuning or dual-stream fusion, positioning ClipCap as an efficient paradigm for capitalizing on foundation models’ frozen semantics for conditional generation tasks (Mokady et al., 2021).