Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conditional Context Optimization (CoCoOp)

Updated 2 April 2026
  • The paper introduces a dynamic prompt generation method that conditions prompts on each image, significantly improving recognition on unseen classes.
  • CoCoOp integrates static context vectors with a lightweight Meta-Net to generate instance-specific tokens, boosting adaptive performance across datasets.
  • Empirical results demonstrate enhanced cross-dataset accuracy and domain robustness, though at the cost of increased per-image computational overhead.

Conditional Context Optimization (CoCoOp) is a prompt learning framework for adapting pre-trained vision-LLMs, such as CLIP, to downstream tasks. CoCoOp extends Context Optimization (CoOp) by introducing dynamic prompts that are conditioned on each input image, overcoming the generalization limitations of static context prompts. Through a lightweight neural prompt generator, CoCoOp improves recognition accuracy on classes unseen during training, enhances transferability to new datasets, and achieves better domain robustness at the cost of additional computation per image (Zhou et al., 2022).

1. Foundations: Vision-LLMs and Prompt Learning

CLIP (Contrastive Language–Image Pre-training) is a prominent vision-LLM that aligns images and natural language via contrastive learning. During CLIP training, an image encoder f()f(\cdot) and a text encoder g()g(\cdot) are jointly optimized using the InfoNCE loss, mapping paired (image, text) inputs to a shared embedding space. At inference, classification for class ii is performed by synthesizing a prompt tit_i (e.g., “a photo of a {class}”), encoding it as wi=g(ti)w_i = g(t_i), and computing class probabilities for an input image II using cosine similarity between f(I)f(I) and wiw_i:

p(yI)=exp(sim(x,wy)/τ)kexp(sim(x,wk)/τ)p(y|I) = \frac{\exp(\mathrm{sim}(x, w_y)/\tau)}{\sum_k \exp(\mathrm{sim}(x, w_k)/\tau)}

where x=f(I)x = f(I) and g()g(\cdot)0 is a temperature parameter.

CLIP relies on hand-crafted, fixed prompts. While robust for zero-shot transfer, these may not maximize performance for specific downstream datasets. Context Optimization (CoOp) addresses this by learning a set of context vectors to optimize prompts on the target data. However, CoOp's static learned contexts overfit base classes and degrade substantially under class shift when evaluated on unseen classes.

2. Model Architecture and Conditional Prompt Generation

CoCoOp introduces instance-conditional prompts by augmenting CoOp's static context tokens with a dynamic token generated per input image. Its architecture includes:

  • Static Context Vectors: g()g(\cdot)1 shared across all prompts, as in CoOp.
  • Class Name Embeddings: For class g()g(\cdot)2, g()g(\cdot)3, possibly multi-token.
  • Meta-Net: A lightweight neural network, g()g(\cdot)4, parameterized by g()g(\cdot)5. g()g(\cdot)6 takes image feature g()g(\cdot)7 and produces an instance-conditional token g()g(\cdot)8. This network is a two-layer MLP with a bottleneck g()g(\cdot)9 with ReLU activation.

Conditional prompts are synthesized in two ways:

  1. Token Appending: Append ii0 as an additional token:

ii1

  1. Token Addition (used in the original work): Add ii2 to each context vector:

ii3

Each prompt token (static, conditional, and class) is fed to the frozen CLIP transformer ii4. Only ii5 and ii6 are tuned, keeping ii7 and ii8 frozen.

3. Training Objective and Optimization

Prompt learning with CoCoOp optimizes a class-conditional cross-entropy loss equivalent to the negative InfoNCE:

ii9

The loss is averaged over a batch of tit_i0 labeled images:

tit_i1

CLIP backbone parameters are frozen; only the context vectors tit_i2 and Meta-Net weights tit_i3 are optimized via gradient descent.

4. Empirical Results and Generalization Performance

Experiments demonstrate that CoCoOp substantially improves generalization over CoOp, particularly under class shift and in cross-dataset transfer. Key findings include:

Base-to-New Class Generalization (11 datasets, 16-shot setting)

Method Base (%) New (%) Harmonic Mean (%)
CLIP (manual prompt) 69.3 74.2 71.7
CoOp (static) 82.7 63.2 71.7
CoCoOp (conditional) 80.5 71.7 75.8

CoOp achieves high accuracy on base classes but suffers a ∼19% drop on new classes. CoCoOp reduces this gap to ∼8.8% and improves the harmonic mean by ∼4.2 points. On ImageNet's new split, CoCoOp increases new-class accuracy from 67.9% (CoOp) to 70.4%.

Cross-Dataset Transfer (ImageNet-trained prompt evaluated zero-shot on 10 datasets)

Average accuracy increases from 63.9% (CoOp) to 65.7% (CoCoOp), with especially notable gains on Flowers (+3.2%), Aircraft (+4.5%), SUN (+3.2%), DTD (+3.8%), and UCF (+1.7%).

Domain Generalization (ImageNet-trained prompt evaluated on variants)

Dataset CoOp (%) CoCoOp (%)
Sketch 47.99 48.75
A 49.71 50.63
R 75.21 76.18

CoCoOp consistently matches or outperforms CoOp on out-of-distribution sets.

5. Ablation Analyses and Component Contributions

A range of ablations clarify the role of design choices:

  • Context Length (tit_i4): Increasing tit_i5 improves new-class accuracy for both CoOp and CoCoOp, though CoCoOp with tit_i6 plus the conditional token achieves comparable or better performance than CoOp with larger tit_i7.
  • Token Initialization: Using CLIP’s embedding of "a photo of a" to initialize tit_i8 outperforms random Gaussian initialization on both base and new classes.
  • Meta-Net vs Parameter Count: Increasing tit_i9 in static CoOp to match CoCoOp’s parameter count does not close the performance gap, confirming that instance conditioning, not just model size, is critical for generalization.
  • Class-Incremental Evaluation: On test sets containing both base and new classes (no retraining), accuracy is 65.2% (CLIP), 65.6% (CoOp), and 69.1% (CoCoOp), supporting increased robustness to class expansion.

6. Limitations and Directions for Further Research

CoCoOp entails additional computational overhead compared to static prompt methods. Each image instance necessitates a unique prompt and thus an individual forward pass through the transformer wi=g(ti)w_i = g(t_i)0, precluding batch processing of prompts. In practice, training uses batch size 1 and more epochs, increasing memory and time requirements.

On 7 out of 11 datasets, CoCoOp’s accuracy on new classes still falls short of CLIP's manual prompts, indicating room for closing the gap under strong class shift. Potential future research includes:

  • Developing more efficient conditional prompt architectures (e.g., parallel static and conditional tokens, shared adapter modules).
  • Scaling Meta-Net capacity or pre-training it across diverse datasets for better transferability.
  • Exploring richer conditional signals, such as multi-scale features or textual image descriptions.

In summary, CoCoOp’s innovation lies in dynamic, image-conditioned prompts, effectively addressing class-shift and out-of-distribution robustness by integrating a small, learnable neural network into the prompt generation process. This yields superior generalization to unseen classes and new domains, at the cost of greater training complexity and computational requirements (Zhou et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Context Optimization (CoCoOp).