Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

Published 7 May 2026 in cs.CV | (2605.05910v1)

Abstract: Prompt learning has become an effective and widely used technique in enhancing vision-LLMs (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents the CAKI framework that injects class-specific knowledge into prompt learning for vision-language models.
It combines class-specific prompt generation with query-key matching to enhance fine-grained discrimination and overcome domain shifts.
Empirical results on 10 datasets show 1–5% accuracy improvements, confirming CAKI’s plug-and-play adaptability with current methods.

Class-aware Knowledge Injection for Prompt Learning in Vision-LLMs

Introduction

The manuscript "Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-LLM" (2605.05910) proposes a prompt learning augmentation strategy for Vision-LLMs (VLMs) such as CLIP. The core innovation is a Class-Aware Knowledge Injection (CAKI) framework that explicitly encodes, retrieves, and injects class-specific knowledge into the inference pipeline for both zero-shot and few-shot recognition regimes. This approach addresses a key limitation in current prompt learning strategies—namely, the failure to model the granularity of class-specific associations at the prompt level, which restricts generalization and fine-grained discrimination, especially for downstream tasks with limited supervision or domain shift.

Methodology

CAKI is architected around two primary modules: Class-specific Prompt Generation (CSPG) and Query-key Prompt Matching (QKPM).

Class-specific Prompt Generation (CSPG): For each class, class-aware prompt learning is performed using the available few-shot class-labelled samples. Each class is associated with a prompt (learned contextual vectors prepended or appended to class names), which is stored as a value in a key-value memory structure. The key is the text embedding of the class name computed by the (frozen) CLIP text encoder, ensuring compatibility with the pre-trained backbone and avoiding catastrophic forgetting.
Query-key Prompt Matching (QKPM): At test time, image features extracted from the visual encoder serve as queries. The coarse prediction from a base prompt learning method (e.g., CoOp) is used to generate similarity-based matching scores, which index the key-value cache for the top- $K$ semantically closest class-specific prompts. These top-K prompt ensembles yield a set of class-aware predictions. The final output for a test sample is an ensemble of the coarse prediction and the fine-grained, class-aware predictions, weighted by the matching scores.
Figure 1: Workflow of CAKI, including class-specific prompt generation, cache memory, query-key prompt matching, and prediction ensembling.

This overall design ensures that general domain or coarse contextual knowledge (from base prompts) and fine-grained class-specific cues are aggregated adaptively, improving both in-domain and out-of-domain performance.

Experimental Results

CAKI is comprehensively evaluated across 10 standard image recognition datasets spanning fine-grained and distributional robustness benchmarks, employing 1-shot, 4-shot, and 16-shot base-to-novel and few-shot learning settings. Its compatibility and compositionality with state-of-the-art prompt learning, test-time adaptation, and feature adapter schemes are also demonstrated.

Key empirical findings:

CAKI achieves consistent absolute improvements (typically 1–5% harmonic mean accuracy) over existing prompt learning methods (CoOp, CoCoOp, MaPLe, PromptSRC, TCP, GalLoP), especially on base-to-novel and cross-domain generalization splits.
The ablation analysis indicates that both CSPG and QKPM are essential; using either in isolation yields marked accuracy degradation compared to their combination.
CAKI is agnostic to the choice of backbone prompt learning, functioning as a plug-and-play enhancement.
CAKI improves not only closed-vocabulary classification but also extends to domain adaptation, semantic segmentation, and object detection tasks, confirming its portability.

A salient analysis on the EuroSAT dataset demonstrates that class-aware prompt matching elevates class-specific prediction accuracy above the matching accuracy of coarse models, with particular benefits for ambiguous classes where domain-blind prompts typically underperform.

Figure 2: Comparison between the recognition and the matching accuracy of CAKI per-category on EuroSAT, highlighting CAKI's gains in resolving classes with visual overlap.

Visualizations further confirm the alignment between retrieved class-specific prompts and correct semantic classes, both for success and failure scenarios.

Figure 3: Visualization of matched samples and their class associations for test queries, depicting effective semantic retrieval and discriminative competence.

Parameter Analysis and Failure Cases

Hyperparameter studies on $\beta$ (coarse/fine knowledge trade-off) and $K$ (top- $K$ prompt retrieval) exhibit that optimal recognition is attained for moderate fusion (e.g., $\beta=0.3$ ) and small $K$ ( $K=3$ ), as overly large $K$ introduces semantic noise and overfitting.

Figure 4: Effect of $\beta$ weighting between coarse- and fine-grained predictions for base class accuracy.

Figure 5: Sensitivity of classification accuracy to the number of retrieved class-specific prompts ( $K$ ).

Qualitative analysis elucidates both the strengths and residual drawbacks of the method. In rare cases characterized by extreme semantic ambiguity (or where test-set classes are insufficiently represented among base-class prompts), both the base and CAKI-augmented models fail, confirming the dependence on the diversity and semantic representativeness of the class-specific prompt bank.

Figure 6: Qualitative failures (top) and successes (bottom) showing the impact of class-level prompt retrieval on fine-grained class discrimination and error correction.

Further, case studies illustrate how CAKI corrects base-model misclassifications in challenging, visually similar scenarios by emphasizing the discriminative potential of matching scores in the class-specific prediction fusion.

Figure 7: Visualization of initial, fine, and ensemble predictions with class-specific prompt weights, showcasing the correction of mispredictions via CAKI.

Implications and Outlook

CAKI provides compelling evidence that prompt learning benefits significantly from explicit class-level knowledge modeling, especially in low-shot and out-of-domain generalization settings. The plug-and-play design lowers barriers to integration, requiring only additional prompt storage and light retrieval operations. The results suggest future directions in making semantic knowledge banks even more robust, e.g., by leveraging LLM-based expansion of class semantics or designing entropy-regularized retrieval to mitigate semantic ambiguity.

This work raises the prospect of extending class-aware knowledge retrieval not only in recognition but also in structured prediction tasks and continual learning, as well as integrating external knowledge sources (such as ontologies or LLM reasoning traces) into the prompt ensemble in a computationally efficient manner.

Conclusion

This paper demonstrates that a class-aware, memory-augmented prompt ensembling strategy, CAKI, universally enhances vision-LLM adaptation by modeling and retrieving class-level cues. Strong empirical gains are achieved for few-/zero-shot base-to-novel splits, cross-domain transfer, and structured vision tasks. Analytical and qualitative studies affirm that the proposed mechanism realizes fine-grained discrimination that is unattainable by prior class-shared or instance-specific prompt designs, marking a significant step toward adaptive and scalable VLM adaptation (2605.05910).

Markdown Report Issue