This paper introduces a novel approach, Context Optimization with Multi-Knowledge Representation (CoKnow), to enhance the performance of Vision-LLMs (VLMs) such as CLIP in downstream tasks. The primary focus is on addressing the limitation of current prompt learning methods, which often lack diversity in prompt templates, thereby restricting the potential of pre-trained VLMs. The authors posit that a single text prompt may not fully capture the complexity of an image, and propose enriching the prompt context by incorporating knowledge from multiple perspectives and abstraction levels, termed Multi-Knowledge Representation.
The authors define Multi-Knowledge as comprising three types: visual knowledge (VK), non-visual knowledge (NVK), and panoramic knowledge (PK). VK includes captions describing the image or its category. NVK incorporates more abstract-level knowledge beyond visual aspects. PK combines multi-level descriptions, such as both VK and NVK, into a comprehensive description. The authors leverage the LLM GPT-4 to automatically generate Multi-Knowledge Representation using a set of simple prompt templates.
The CoKnow framework consists of two key modules: a prompt optimizer guided by Multi-Knowledge representation and a lightweight semantic knowledge mapper. The prompt optimizer facilitates adaptive learning of prompt templates rich in domain knowledge. The semantic knowledge mapper generates Multi-Knowledge Representation from images without requiring additional input. The framework is designed to be plug-and-play compatible with VLMs beyond CLIP.
The method involves inputting three types of templates into the text encoder: learnable context (soft prompt), Multi-Knowledge, and hand-crafted templates. The image encoder outputs are processed through semantic knowledge mappers. Contrastive loss is calculated between the image embeddings and the target template embeddings. The original image representation and the mapped image representation are combined using a weighting parameter before undergoing contrastive loss calculation with the learnable contexts. The semantic knowledge mappers, and , are implemented as three-layer fully connected neural networks, where the hidden layer dimension is one-fourth of the input dimension, followed by a ReLU activation function.
During inference, the probability of an image belonging to category is calculated using the following equations:
- : combined image representation
- : weight parameter
- : output of the image encoder for the given image
- , : outputs of the semantic knowledge mappers for the given image
- : probability of image belonging to category
- : output of the text encoder when the category is
- : cosine similarity between and
- : temperature parameter
- : total number of categories
Experiments were conducted on 11 publicly available datasets, including ImageNet, Caltech, Oxford-Pets, Flowers, Food101, Stanford Cars, FGVC Aircraft, EuroSAT, UCF101, DTD, and SUN397. The authors followed the few-shot evaluation protocol outlined in CoOp, utilizing 1, 2, 4, 8, and 16 shots for training. ResNet-50 was used as the backbone architecture for the CLIP image encoder, with ViT-B/16 also evaluated. CoKnow consistently outperformed previous methods, demonstrating its effectiveness for prompt learning in VLMs. Specifically, the method demonstrates average top-1 accuracy rates surpassing CoOp across each dataset. With 4 shots, CoKnow's results approach those of CoOp with 8 shots and surpass Wise-FT's results.
Ablation studies were performed to analyze the impact of different Multi-Knowledge types (VK, NVK, PK) and the weighting parameter . The results indicated that PK generally provides the best performance. The authors also explored the impact of different context lengths and classname positions on the results. Additionally, the paper investigates the impact of using semantic knowledge mappers to map the image representations of the original CLIP. The experimental results indicate that the image representations of the original CLIP have a significant impact on Prompt Learning, especially when the training sample size is 1-shot.
The paper also evaluated the robustness of CoKnow under out-of-distribution (OOD) conditions, where prompt learning was conducted on ImageNet and tested on ImageNetV2. The results suggest that the method effectively generalizes to out-of-distribution datasets.