Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training (2303.03032v1)

Published 6 Mar 2023 in cs.CV, cs.AI, and cs.CL
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Abstract: Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts approach to zero-shot captioning by either utilizing the existing LLMs (e.g., GPT-2) or pre-training the encoder-decoder network in an end-to-end manner. In this work, we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data. 2) it does not require end-to-end training. When trained with text-only data, the decoder takes the text embedding extracted from the off-the-shelf CLIP encoder as a prefix embedding. The challenge is that the decoder is trained on the text corpus but at the inference stage, it needs to generate captions based on visual inputs. The modality gap issue is widely observed in multi-modal contrastive models that prevents us from directly taking the visual embedding as the prefix embedding. We propose a training-free mechanism to reduce the modality gap. We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input. Taking the projected embedding as the prefix embedding, the decoder generates high-quality descriptions that match the visual input. The experiments show that DeCap outperforms other zero-shot captioning methods and unpaired captioning methods on the typical image captioning benchmarks, i.e., MSCOCO and NoCaps.

The paper introduces DeCap, a framework for zero-shot captioning leveraging Contrastive Language-Image Pre-training (CLIP) embeddings and a lightweight, text-only trained decoder. The core idea is to train a decoder to invert the CLIP text encoder, enabling caption generation from CLIP text embeddings. During inference, a training-free projection mechanism maps visual embeddings into the CLIP text embedding space, mitigating the modality gap that arises when directly using visual embeddings for captioning.

Key aspects of DeCap and the associated research include:

  • Framework Overview: DeCap comprises a pre-trained CLIP model and a text decoder. The text decoder is trained to reconstruct sentences from CLIP text embeddings. During inference, visual embeddings are projected into the text embedding space via a support memory and then decoded into captions.
  • Text-Only Decoder Pre-training: The text decoder is trained from scratch using a prefix LLMing objective. Given a sentence t={word1,word2,...,wordt}t = \{word_1,word_2,...,word_{|t|}\}, the prefix LLM PθP_\theta learns to reconstruct tt conditioned on the text embedding extracted by a fixed CLIP text encoder, as described by the following equation:

    LRecons(θ)=1ti=1tlogPθ(wordiword<i,Etext(t))\mathcal{L}_{Recons}(\theta) = -\frac{1}{| t |} \sum_{i=1}^{|t|}\log P_\theta(word_i|word_{<i},E_{text}(t))

    where:

    • LRecons(θ)\mathcal{L}_{Recons}(\theta) is the reconstruction loss
    • t|t| is the length of the sentence
    • wordiword_i is the ii-th word in the sentence
    • PθP_\theta is the prefix LLM with parameters θ\theta
    • Etext()E_{text}(\cdot) is the CLIP text encoder

    This approach enables control over the style of generated sentences by adjusting the source of the text-only data.

  • Projection-based Decoding (PD): A training-free mechanism projects image embeddings into the CLIP text embedding space using a support memory M={m1,m2,...,mN}M = \{\mathbf{m_1}, \mathbf{m_2}, ... , \mathbf{m_N}\}, where mi=Etext(ti)\mathbf{m_i} = E_{text}(t_i). Given an image embedding v=Eimage(I)\mathbf{v} = E_{image}(I), its representation in text embedding space vproj\mathbf{v}_{proj} is obtained via a weighted combination:

    v<em>proj=</em>i=1<sup>N</sup>wim<em>i=</em>i=1<sup>N</sup>exp((m<em>i<sup></sup>v)/τ)</em>k=1<sup>Nexp((mk<sup></sup></sup>v)/τ)mi\mathbf{v}<em>{proj} = \sum</em>{i=1}<sup>N</sup> w_i* \mathbf{m}<em>i = \sum</em>{i=1}<sup>N</sup> \frac{\exp((\mathbf{m}<em>i<sup>{\top}</sup> \mathbf{v}) / \tau)}{\sum</em>{k=1}<sup>{N}\exp((\mathbf{m}_k<sup>{\top}</sup></sup> \mathbf{v})/ \tau)}* \mathbf{m}_i

    where:

    • vproj\mathbf{v}_{proj} is the combined project vector
    • NN is the size of the text set TT
    • wiw_i is the weight of the ii-th text embedding in support memory
    • τ\tau is a temperature parameter
    • mi\mathbf{m}_i is the CLIP text embedding of the ii-th sentence
  • Alternative Inference Strategies: The paper compares Projection-based Decoding (PD) with:
    • CLIPRe (CLIP Retrieval): A retrieval-based approach that retrieves the most relevant text from a set based on CLIP image-text similarity.
    • Visual Decoding (VD): Directly using the image embedding as the prefix embedding.
    • Nearest-neighbor Decoding (NND): Using the nearest text embedding as the prefix embedding.
  • Experimental Results: DeCap was evaluated on zero-shot image captioning (MSCOCO, NoCaps), unpaired image captioning (MSCOCO, Flickr30K), and video captioning (MSR-VTT, ActivityNet-Captions, VATEX). DeCap outperforms other zero-shot captioning methods on image captioning benchmarks MSCOCO and NoCaps. The framework also achieves state-of-the-art zero-shot results on MSR-VTT and ActivityNet-Captions for video captioning. Ablation studies explore the impact of training data size and support memory size. The paper shows DeCap benefits from a large data size. The CIDEr score drops from 91.2% to 81.5% when using only 1% of data (5.6K captions) for training on the full set.
  • Datasets: The framework was evaluated using CC3M, SS1M, and Book Corpus for training and MSCOCO and NoCaps for validation. For video captioning, the method was evaluated on MSR-VTT, ActivityNet Captions, and VATEX.
  • Ablation Studies: The impact of the training data size and the support memory size was explored, revealing that DeCap benefits from larger training datasets and support memories.
  • Impact of Reconstruction: The paper showed the reconstruction task improves the performance of CLIPRe. The reconstructed dataset improved the performance of CLIPRe on all metrics, especially on CIDEr, from 53.4% to 63.6%.
  • Zero-shot performance: The paper shows that, on MSCOCO, DeCap pre-trained on CC3M-text outperforms ZeroCap by 27.5% in CIDEr. DeCap pre-trained on SS1M outperforms ZeroCap by 36% in CIDEr.

Overall, the paper presents DeCap as a flexible and efficient framework for zero-shot captioning, demonstrating strong performance across various captioning scenarios and datasets. The method's reliance on text-only training data and a training-free projection mechanism makes it adaptable to new domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wei Li (1121 papers)
  2. Linchao Zhu (78 papers)
  3. Longyin Wen (45 papers)
  4. Yi Yang (855 papers)
Citations (68)