ClipCap: Efficient Image Captioning

Updated 16 December 2025

ClipCap is a vision-language model that leverages a frozen CLIP encoder and GPT-2, bridged by a learned prefix mapping, to generate image captions efficiently.
Its architecture offers two mapping variants—MLP for fine-tuning and Transformer for frozen GPT-2—ensuring parameter efficiency and rapid training.
The model achieves competitive performance on datasets like COCO, Conceptual Captions, and nocaps while substantially reducing GPU compute time compared to prior methods.

ClipCap (“CLIP Prefix for Image Captioning”) is a vision-LLM that addresses image captioning through a simple and parameter-efficient framework. ClipCap leverages frozen pre-trained CLIP visual encoders and GPT-2 LLMs, bridging them by learning a mapping from CLIP image representations to a learned “prefix” compatible with GPT-2’s embedding space. This architectural separation yields a lightweight, rapidly trainable captioning model that requires no additional object detectors or specialized visual annotations, and achieves competitive results on benchmarks such as Conceptual Captions, COCO, and nocaps (Mokady et al., 2021).

1. Model Architecture

ClipCap’s architecture comprises three main components: a frozen CLIP encoder, a mapping network, and a LLM (GPT-2). An input image $x \in \mathbb{R}^{H \times W \times 3}$ is first encoded into a visual embedding $\vec{z} = \mathrm{CLIP}(x) \in \mathbb{R}^d$ (e.g., $d=512$ for ViT-B/32 or $d=1024$ for RN50). The mapping network $F$ projects $\vec{z}$ to a sequence of $m$ “prefix” vectors $P = [p_1, \ldots, p_m] \in \mathbb{R}^{m \times h}$ , where $h$ is GPT-2’s hidden size (e.g., $h=768$ for GPT-2 Small).

The LLM component—standard GPT-2, either frozen or fine-tuned—receives as input the concatenation of this prefix $P$ and the target caption’s token embeddings. The prefix acts as context, “steering” GPT-2 to conditionally generate image-appropriate captions. The forward path is:

$\vec{z} \leftarrow \mathrm{CLIP}(x)$
$[p_1, \ldots, p_m] \leftarrow F(\vec{z})$
$E_{\mathrm{tokens}} \leftarrow \mathrm{GPT2}.\mathrm{token\_embed}([c_1, \ldots, c_\ell])$
$X \leftarrow \mathrm{concat}([p_1, \ldots, p_m], E_{\mathrm{tokens}})$
$\text{logits} \leftarrow \mathrm{GPT2}.\mathrm{forward}(X)$

Both mapping and decoding steps are designed for maximum reuse of pre-trained knowledge, minimizing the trainable parameter count and enabling fast convergence (Mokady et al., 2021).

2. Mapping Network Designs

Two mapping network variants are implemented: a multilayer perceptron (MLP) and a Transformer-based mapping.

MLP Variant: Used when GPT-2 is fine-tuned. The mapping is:

$h_1 = \phi(W_1 \vec{z} + b_1)$

$\mathrm{vec}(P) = W_2 h_1 + b_2$

with $W_1 \in \mathbb{R}^{\mathrm{hidden} \times d}$ , $W_2 \in \mathbb{R}^{(m h) \times \mathrm{hidden}}$ . Nonlinearity $\phi$ is either ReLU or Tanh.

Transformer Variant: Used when GPT-2 remains frozen. Incorporates a learned constant matrix $C_0 \in \mathbb{R}^{m \times d}$ and $L$ self-attention layers, mapping the visual embedding and the constant to $m$ tokens in $\mathbb{R}^{m \times h}$ . Layer computations include pre-norm LayerNorm, multi-head attention, and two-layer MLPs. The final output is projected with $W_{\text{out}}$ to the GPT-2 hidden space.

Both architectures incorporate dropout (rate $0.1$), AdamW optimizer with weight decay, and pre-norm LayerNorm. Selection between MLP and Transformer mapping is governed by whether GPT-2 is trainable or frozen, as MLP is sufficient for the learned context in the fine-tuning regime, whereas the Transformer mapping is more expressive and necessary when GPT-2 parameters are fixed (Mokady et al., 2021).

3. Training Procedure and Data

The model is trained to minimize cross-entropy loss over caption tokens, using the following objective:

$\mathcal{L} = -\sum_{i=1}^N\sum_{j=1}^{\ell} \log p_{\theta}(c_j^i | x^i, c_{<j}^i)$

where $N$ is the number of samples and $\ell$ is the caption length.

Datasets include:

Conceptual Captions: 3M image-caption pairs from the web, with a 12.5K validation set.
COCO Karpathy split: Standard train/validation/test images and five captions per image.
nocaps: Consists of COCO-train images, evaluating on in-domain, near-domain, and out-domain sets.

Captions are tokenized with GPT-2’s byte-pair encoding tokenizer and padded/truncated to length $\ell=40$ . Images are preprocessed per CLIP’s input requirements (e.g., resizing to $224 \times 224$ and normalization to CLIP statistics).

Training schedules employ AdamW ( $\beta_1=0.9$ , $\beta_2=0.999$ ), initial learning rate $2 \times 10^{-5}$ , linear warm-up ($5$k steps), batch size $40$, and task-appropriate epochs: $10$ for Conceptual Captions, $5\text{--}10$ for COCO/nocaps. Model convergence on Conceptual Captions requires roughly $80$ GPU·h (frozen or fine-tuned) on a GTX 1080 Ti, while COCO/nocaps converge within $6\text{--}7$ GPU·h (Mokady et al., 2021).

4. Evaluation, Ablation, and Results

ClipCap’s performance is evaluated using standard metrics: ROUGE-L, CIDEr, SPICE, BLEU-4, and METEOR, across Conceptual Captions, nocaps, and COCO.

<table> <thead> <tr> <th>Dataset</th> <th>ClipCap Variant</th> <th>Key Metrics</th> <th>Trainable Params</th> <th>GPU·h</th> </tr> </thead> <tbody> <tr> <td>Conceptual Captions</td> <td>MLP + GPT-2 fine-tune</td> <td>ROUGE-L: 26.71,<br> CIDEr: 87.26,<br> SPICE: 18.50</td> <td\>156M</td> <td\>80</td> </tr> <tr> <td>Conceptual Captions</td> <td>Transformer + frozen GPT-2</td> <td>ROUGE-L: 25.12,<br> CIDEr: 71.82,<br> SPICE: 16.07</td> <td\>43M</td> <td\>72</td> </tr> <tr> <td>nocaps (overall)</td> <td>Transformer + frozen GPT-2</td> <td>CIDEr: 65.83,<br> SPICE: 10.86</td> <td\>43M</td> <td\>6</td> </tr> <tr> <td>COCO (test)</td> <td>Transformer + frozen GPT-2</td> <td>B@4: 33.53,<br> METEOR: 27.45,<br> CIDEr: 113.08, <br>SPICE: 21.05</td> <td\>43M</td> <td\>6</td> </tr> </tbody> </table>

Compared to prior models (e.g., VLP, Oscar, BUTD), ClipCap is within $1\text{--}3$ points of state-of-the-art metrics while decreasing compute cost by one to two orders of magnitude (e.g., VLP on Conceptual Captions: 1,200 GPU·h vs. ClipCap’s 72–80 GPU·h).

Ablation findings indicate:

Prefix length $m$ : For MLP+fine-tune, performance saturates at $m = 10$ , whereas Transformer+frozen variant improves up to $m \approx 40\text{--}80$ . E.g., on COCO, $m=1$ yields B@4=28.7/CIDEr=101.0, while $m=80$ gives B@4=34.24/CIDEr=115.07.
GPT-2 fine-tuning: Boosts results on datasets with diverse captioning styles (e.g., +15.4 CIDEr for Conceptual Captions). On more uniform datasets, frozen GPT-2 is competitive or slightly superior, and is favored for parameter efficiency.
Mapping architecture: The MLP is sufficient when GPT-2 is fine-tuned but inadequate with frozen GPT-2. The Transformer mapping is essential in the frozen setting.
Prefix interpretability: Decoding prefix vectors by nearest-neighbor in GPT-2’s vocabulary shows semantic alignment (“motorcycle”, “showcase", etc.) when GPT-2 is tuned; with frozen GPT-2, outputs are gibberish, suggesting prefix vectors combine visual content and “steering” information simultaneously (Mokady et al., 2021).

5. Advantages and Limitations

ClipCap provides several empirical and architectural advantages:

Simplicity: Requires no auxiliary object detectors or bounding-box supervision.
Efficiency: Trains in $6$–$80$ GPU·h versus hundreds or thousands for some prior art.
Parameter-efficiency: The lightest variant uses only 43M trainable parameters.
Competitiveness: Results are within 1–3 points of the strongest models on diverse and open-ended captioning datasets.

However, limitations remain:

Dependence on CLIP’s recognition: Performance is limited by CLIP’s visual encoding; objects not identified by CLIP are omitted in generated captions.
No sequence-level metric optimization: CIDEr-RL (e.g., self-critical training) is not used and could improve corpus-level metrics.
Autoregressive decoding: Restricted to sequential (GPT-style) language generation; non-autoregressive or bidirectional decoders such as BART are noted for future paper (Mokady et al., 2021).

6. Interpretation and Broader Impact

ClipCap demonstrates that prefix-based “steering” of pre-trained LLMs with visual context is effective for vision-language tasks such as image captioning. The approach is parameter- and compute-efficient, enabling rapid experimentation and large-scale deployment without the need to fine-tune massive encoders or decoders.

A plausible implication is that prefix-based conditioning could generalize to broader multimodal tasks, provided the mapping network can align modality-specific features into the required input space of powerful, frozen decoders. The current design reveals both advantages (modular, interpretable, efficient) and constraints (reliance on the quality of frozen encoders, limited to the expressivity of prefix representations).

7. Comparison with Prior Art

ClipCap is closely related to models that use large-scale pre-trained representations such as CLIP for visual encoding and pre-trained Transformers for text generation. Notably, it dispenses with the need for joint training of both modalities, instead using a learned mapping as a bridge. Compared to VLP and Oscar, which involve training with object detectors or larger multimodal pre-training, ClipCap offers a simpler pipeline while achieving comparable results with substantially fewer resources.

The use of a prefix-mapping approach as a modality adapter contrasts with methods relying on joint fine-tuning or dual-stream fusion, positioning ClipCap as an efficient paradigm for capitalizing on foundation models’ frozen semantics for conditional generation tasks (Mokady et al., 2021).

PDF Markdown Chat (Pro)

References (1)

ClipCap: CLIP Prefix for Image Captioning (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ClipCap Model.