- The paper presents the CAKI framework that injects class-specific knowledge into prompt learning for vision-language models.
- It combines class-specific prompt generation with query-key matching to enhance fine-grained discrimination and overcome domain shifts.
- Empirical results on 10 datasets show 1โ5% accuracy improvements, confirming CAKIโs plug-and-play adaptability with current methods.
Class-aware Knowledge Injection for Prompt Learning in Vision-LLMs
Introduction
The manuscript "Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-LLM" (2605.05910) proposes a prompt learning augmentation strategy for Vision-LLMs (VLMs) such as CLIP. The core innovation is a Class-Aware Knowledge Injection (CAKI) framework that explicitly encodes, retrieves, and injects class-specific knowledge into the inference pipeline for both zero-shot and few-shot recognition regimes. This approach addresses a key limitation in current prompt learning strategiesโnamely, the failure to model the granularity of class-specific associations at the prompt level, which restricts generalization and fine-grained discrimination, especially for downstream tasks with limited supervision or domain shift.
Methodology
CAKI is architected around two primary modules: Class-specific Prompt Generation (CSPG) and Query-key Prompt Matching (QKPM).
- Class-specific Prompt Generation (CSPG): For each class, class-aware prompt learning is performed using the available few-shot class-labelled samples. Each class is associated with a prompt (learned contextual vectors prepended or appended to class names), which is stored as a value in a key-value memory structure. The key is the text embedding of the class name computed by the (frozen) CLIP text encoder, ensuring compatibility with the pre-trained backbone and avoiding catastrophic forgetting.
- Query-key Prompt Matching (QKPM): At test time, image features extracted from the visual encoder serve as queries. The coarse prediction from a base prompt learning method (e.g., CoOp) is used to generate similarity-based matching scores, which index the key-value cache for the top-K semantically closest class-specific prompts. These top-K prompt ensembles yield a set of class-aware predictions. The final output for a test sample is an ensemble of the coarse prediction and the fine-grained, class-aware predictions, weighted by the matching scores.
Figure 1: Workflow of CAKI, including class-specific prompt generation, cache memory, query-key prompt matching, and prediction ensembling.
This overall design ensures that general domain or coarse contextual knowledge (from base prompts) and fine-grained class-specific cues are aggregated adaptively, improving both in-domain and out-of-domain performance.
Experimental Results
CAKI is comprehensively evaluated across 10 standard image recognition datasets spanning fine-grained and distributional robustness benchmarks, employing 1-shot, 4-shot, and 16-shot base-to-novel and few-shot learning settings. Its compatibility and compositionality with state-of-the-art prompt learning, test-time adaptation, and feature adapter schemes are also demonstrated.
Key empirical findings:
- CAKI achieves consistent absolute improvements (typically 1โ5% harmonic mean accuracy) over existing prompt learning methods (CoOp, CoCoOp, MaPLe, PromptSRC, TCP, GalLoP), especially on base-to-novel and cross-domain generalization splits.
- The ablation analysis indicates that both CSPG and QKPM are essential; using either in isolation yields marked accuracy degradation compared to their combination.
- CAKI is agnostic to the choice of backbone prompt learning, functioning as a plug-and-play enhancement.
- CAKI improves not only closed-vocabulary classification but also extends to domain adaptation, semantic segmentation, and object detection tasks, confirming its portability.
A salient analysis on the EuroSAT dataset demonstrates that class-aware prompt matching elevates class-specific prediction accuracy above the matching accuracy of coarse models, with particular benefits for ambiguous classes where domain-blind prompts typically underperform.
Figure 2: Comparison between the recognition and the matching accuracy of CAKI per-category on EuroSAT, highlighting CAKI's gains in resolving classes with visual overlap.
Visualizations further confirm the alignment between retrieved class-specific prompts and correct semantic classes, both for success and failure scenarios.
Figure 3: Visualization of matched samples and their class associations for test queries, depicting effective semantic retrieval and discriminative competence.
Parameter Analysis and Failure Cases
Hyperparameter studies on ฮฒ (coarse/fine knowledge trade-off) and K (top-K prompt retrieval) exhibit that optimal recognition is attained for moderate fusion (e.g., ฮฒ=0.3) and small K (K=3), as overly large K introduces semantic noise and overfitting.
Figure 4: Effect of ฮฒ weighting between coarse- and fine-grained predictions for base class accuracy.
Figure 5: Sensitivity of classification accuracy to the number of retrieved class-specific prompts (K).
Qualitative analysis elucidates both the strengths and residual drawbacks of the method. In rare cases characterized by extreme semantic ambiguity (or where test-set classes are insufficiently represented among base-class prompts), both the base and CAKI-augmented models fail, confirming the dependence on the diversity and semantic representativeness of the class-specific prompt bank.
Figure 6: Qualitative failures (top) and successes (bottom) showing the impact of class-level prompt retrieval on fine-grained class discrimination and error correction.
Further, case studies illustrate how CAKI corrects base-model misclassifications in challenging, visually similar scenarios by emphasizing the discriminative potential of matching scores in the class-specific prediction fusion.
Figure 7: Visualization of initial, fine, and ensemble predictions with class-specific prompt weights, showcasing the correction of mispredictions via CAKI.
Implications and Outlook
CAKI provides compelling evidence that prompt learning benefits significantly from explicit class-level knowledge modeling, especially in low-shot and out-of-domain generalization settings. The plug-and-play design lowers barriers to integration, requiring only additional prompt storage and light retrieval operations. The results suggest future directions in making semantic knowledge banks even more robust, e.g., by leveraging LLM-based expansion of class semantics or designing entropy-regularized retrieval to mitigate semantic ambiguity.
This work raises the prospect of extending class-aware knowledge retrieval not only in recognition but also in structured prediction tasks and continual learning, as well as integrating external knowledge sources (such as ontologies or LLM reasoning traces) into the prompt ensemble in a computationally efficient manner.
Conclusion
This paper demonstrates that a class-aware, memory-augmented prompt ensembling strategy, CAKI, universally enhances vision-LLM adaptation by modeling and retrieving class-level cues. Strong empirical gains are achieved for few-/zero-shot base-to-novel splits, cross-domain transfer, and structured vision tasks. Analytical and qualitative studies affirm that the proposed mechanism realizes fine-grained discrimination that is unattainable by prior class-shared or instance-specific prompt designs, marking a significant step toward adaptive and scalable VLM adaptation (2605.05910).