The paper introduces DeCap, a framework for zero-shot captioning leveraging Contrastive Language-Image Pre-training (CLIP) embeddings and a lightweight, text-only trained decoder. The core idea is to train a decoder to invert the CLIP text encoder, enabling caption generation from CLIP text embeddings. During inference, a training-free projection mechanism maps visual embeddings into the CLIP text embedding space, mitigating the modality gap that arises when directly using visual embeddings for captioning.
Key aspects of DeCap and the associated research include:
- Framework Overview: DeCap comprises a pre-trained CLIP model and a text decoder. The text decoder is trained to reconstruct sentences from CLIP text embeddings. During inference, visual embeddings are projected into the text embedding space via a support memory and then decoded into captions.
- Text-Only Decoder Pre-training: The text decoder is trained from scratch using a prefix LLMing objective. Given a sentence , the prefix LLM learns to reconstruct conditioned on the text embedding extracted by a fixed CLIP text encoder, as described by the following equation:
where:
- is the reconstruction loss
- is the length of the sentence
- is the -th word in the sentence
- is the prefix LLM with parameters
- is the CLIP text encoder
This approach enables control over the style of generated sentences by adjusting the source of the text-only data.
Projection-based Decoding (PD): A training-free mechanism projects image embeddings into the CLIP text embedding space using a support memory , where . Given an image embedding , its representation in text embedding space is obtained via a weighted combination:
where:
- is the combined project vector
- is the size of the text set
- is the weight of the -th text embedding in support memory
- is a temperature parameter
- is the CLIP text embedding of the -th sentence
- Alternative Inference Strategies: The paper compares Projection-based Decoding (PD) with:
- CLIPRe (CLIP Retrieval): A retrieval-based approach that retrieves the most relevant text from a set based on CLIP image-text similarity.
- Visual Decoding (VD): Directly using the image embedding as the prefix embedding.
- Nearest-neighbor Decoding (NND): Using the nearest text embedding as the prefix embedding.
- Experimental Results: DeCap was evaluated on zero-shot image captioning (MSCOCO, NoCaps), unpaired image captioning (MSCOCO, Flickr30K), and video captioning (MSR-VTT, ActivityNet-Captions, VATEX). DeCap outperforms other zero-shot captioning methods on image captioning benchmarks MSCOCO and NoCaps. The framework also achieves state-of-the-art zero-shot results on MSR-VTT and ActivityNet-Captions for video captioning. Ablation studies explore the impact of training data size and support memory size. The paper shows DeCap benefits from a large data size. The CIDEr score drops from 91.2% to 81.5% when using only 1% of data (5.6K captions) for training on the full set.
- Datasets: The framework was evaluated using CC3M, SS1M, and Book Corpus for training and MSCOCO and NoCaps for validation. For video captioning, the method was evaluated on MSR-VTT, ActivityNet Captions, and VATEX.
- Ablation Studies: The impact of the training data size and the support memory size was explored, revealing that DeCap benefits from larger training datasets and support memories.
- Impact of Reconstruction: The paper showed the reconstruction task improves the performance of CLIPRe. The reconstructed dataset improved the performance of CLIPRe on all metrics, especially on CIDEr, from 53.4% to 63.6%.
- Zero-shot performance: The paper shows that, on MSCOCO, DeCap pre-trained on CC3M-text outperforms ZeroCap by 27.5% in CIDEr. DeCap pre-trained on SS1M outperforms ZeroCap by 36% in CIDEr.
Overall, the paper presents DeCap as a flexible and efficient framework for zero-shot captioning, demonstrating strong performance across various captioning scenarios and datasets. The method's reliance on text-only training data and a training-free projection mechanism makes it adaptable to new domains.