Class-Aware Prompting Strategy

Updated 28 November 2025

Class-aware prompting is a strategy that explicitly integrates class-level semantic information into model prompts to enhance specificity and adaptability.
It leverages learned soft prompts, discrepancy-aware mechanisms, and dynamic instance-based assignments to boost performance in vision-language, continual learning, and adversarial settings.
Empirical studies demonstrate significant gains, such as improved accuracy and robustness, across diverse applications like few-shot segmentation and data-free quantization.

A class-aware prompting strategy is a principled approach in modern neural architectures aimed at explicitly infusing class- or task-level semantic information into prompts during training, adaptation, or inference. By contrast to universal or mixed prompts, class-aware prompts are parametrized, structured, or dynamically selected based on the specific class label, task, or instance, thereby enabling the model to focus attention, align modalities, or adapt more effectively. Class-aware prompting has played a central role in recent advances across vision-language modeling, continual learning, adversarial robustness, few-shot segmentation, data-free quantization, audio-visual learning, and medical imaging. This article synthesizes methodologies, mathematical formulations, and empirical evidence regarding class-aware prompting from the latest arXiv research.

1. Core Methodologies for Class-aware Prompting

Class-aware prompting manifests through several key design patterns:

Learned Soft or Continuous Prompts per Class/Task: Prompts consist of learnable vectors prepended or injected into the model (e.g., at input, across transformer layers). LASP optimizes soft prompts to align with hand-crafted textual descriptions for each class using a dedicated text-to-text cross-entropy loss, thereby regularizing prompt learning and enabling explicit class-conditional generalization (Bulat et al., 2022). Similarly, class-specific phase and amplitude prompts in PAP are learned per class and injected via DFT manipulation for adversarial defense (Xu et al., 6 Feb 2025).
Discrepancy-aware and Subcategory-specific Prompting: Fine-grained classification requires prompts that highlight subcategory-level differences. MP-FGVC introduces subcategory-specific vision prompts (SsVP)—parameter-free patch selection mechanisms—and discrepancy-aware text prompts (DaTP) with learnable context tokens that express inter-class differences, both fused in a dedicated vision-language fusion module (Jiang et al., 2023).
Dynamic and Instance-aware Prompt Assignment: In MCIL settings, static prompting over- or under-adapts across the sample/task spectrum. The IAP framework introduces IA-GP (per-layer, per-instance gating) and IA-CDDP (two-stage, Gaussian-based confidence scoring) to modulate prompt strength and selection in response to both class and instance distribution (Fu et al., 26 Mar 2025).
Prompt Generation via Class Mixup or Conditioning: In PTQ without real data, mixup-class prompting constructs synthetic text prompts by fusing labels at the prompt level (e.g., "a photo of a cat and a dog"), leading to diverse, in-distribution samples and more robust quantization (Park et al., 29 Jul 2025). Class-conditional prompting in audio-visual segmentation leverages learned generative priors (e.g., GMMs over mask embeddings) to sample queries explicitly tied to each class, stabilizing matching and attention (Chen et al., 7 Jul 2024).
Task/Ordinal-aware and Incremental Prompting: In continual learning, INCPrompt utilizes task-specific prompts alongside a shared adaptive key-learner, isolating old and new knowledge and mitigating interference (Wang et al., 22 Jan 2024). For ordinal regression, CLIP-DR introduces ranking-aware prompts encoded as learnable context concatenated to class labels, together with chain-based ranking constraints reflecting class order (Yu et al., 4 Jul 2024).
Cross-modal Prompt Initialization and Enhancement: Few-shot segmentation frameworks like PAT initialize prompts by encoding the target class name using a frozen text encoder, followed by cross-modal transfer and specialization (via SPT/PMG) to refine prompts toward class-centric visual concepts (Bi et al., 16 Sep 2024).

2. Mathematical Formulations

The mathematical backbone of class-aware prompting spans several modalities and architectures. Key formalizations include:

Text-to-Text Cross-Entropy for Prompt Regularization For class $c$ and template $l$ , soft prompt embedding $t^r$ is forced (via CE loss) to match the embedding $t_c^{h,l}$ of its hand-crafted textual template:

$P_{rh}^{l}(y|t^r) = \frac{\exp(\cos(t_y^{h,l}, t^r) / \tau)}{\sum_{c'=1}^C \exp(\cos(t_{c'}^{h,l}, t^r) / \tau)}$

The full objective averages over $L$ templates and all classes (Bulat et al., 2022).

Gaussian-based Confidence Scoring for Instance-aware Prompting Given stored task- or class-level means and covariances $(\mu_i, \Sigma_i)$ , test features are scored and thresholded for prompt weighting:

$E'_i = \log \varphi(F_v(\hat{x}); \mu_i, \Sigma_i)$

Weight $\hat{E}$ is dynamically assigned via sigmoid/log-likelihood thresholds and class-specific averaging (Fu et al., 26 Mar 2025).

DFT-Domain Class-aware Prompting Each input $x$ receives a phase-level and amplitude-level vector per class $k$ ; the prompted image is:

$x^p = F^{-1}(\phi_x + p_{\phi_y}, \xi_x + w\cdot p_{\xi_y})$

Here $\phi_x, \xi_x$ are DFT phase and amplitude; prompts are weighted via batch-wise robust accuracy statistics (Xu et al., 6 Feb 2025).

Prompt Selection via Key Similarity In class-incremental settings, task keys $K_s$ are compared to current token representations to select which prompt group to apply for inference:

$\hat s = \arg\max_{s\le t} \textrm{sim}(K_s(T_k), T_k)$

(Wang et al., 22 Jan 2024).

Ranking-aware Pairwise Loss for Ordinal Classes For image $x_i$ of class $j$ , enforces $\tilde s_{i,j}$ to rank above all adjacent classes in both directions:

$L^i_{\text{rightward}} = -\sum_{r=j}^{K-1} \log \frac{\exp(\tilde s_{i,r}/\tau)}{\exp(\tilde s_{i,r}/\tau) + \exp(\tilde s_{i,r+1}/\tau)}$

The total loss is combined with a main CLIP-style matching loss (Yu et al., 4 Jul 2024).

3. Architectural Integration and Operational Flow

The design and application of class-aware prompts are differentiated by their point of insertion and operational flow:

Transformer-based Architectures: Prompts are often appended to the input token stream or injected in attention mechanisms (concatenation with key and value matrices). In the case of zero-parameter prompt selectors (e.g., SsVP in MP-FGVC), subcategory-specific spatial tokens are selected at intermediate transformer blocks.
Dynamic Selection and Weighting: Instance-aware frameworks dynamically select which prompts to use (IA-GP) and/or how much to weigh them (IA-CDDP), ensuring adaptability and selective plasticity. Virtual classes—in LASP-V—are incorporated by extending the prompt set and loss to class names without requiring corresponding visual samples (Bulat et al., 2022).
Cross-modal Prompt Initialization: Many recent pipelines begin by encoding textual class names (via CLIP or similar) and projecting into the visual token space, ensuring prompts are semantically meaningful upon initialization (Bi et al., 16 Sep 2024).
Prompt Enhancement and Specialization: Refinement stages using masked attention, semantic transfer (SPT), or part mask generators (PMG) adapt prompts to the particular appearance and structure of classes and even parts (PAT) (Bi et al., 16 Sep 2024).
Contrastive and Cross-modal Learning Objectives: In multimodal and segmentation tasks, cross-modal fusion modules align vision and language/text prompt spaces using attention and contrastive losses, ensuring jointly informative, class-aware representations (Jiang et al., 2023, Chen et al., 7 Jul 2024).

4. Empirical Performance and Ablation Studies

Empirical evaluation repeatedly confirms the superiority of class-aware prompting strategies across diverse settings:

Paper / Task	Core Metric	Class-aware gain	Reference
LASP, few/zero-shot CLIP	Novel class acc	up to +11% (EuroSAT); +2.6% avg; +1.9% w/ virtual	(Bulat et al., 2022)
INCPrompt, CIL	CIFAR-100 Acc	85.03% (vs. 84.01% prior SOTA)	(Wang et al., 22 Jan 2024)
MP-FGVC, FG classification	Top-1	91.8% (CUB), +0.7% by class-aware vs single-modal	(Jiang et al., 2023)
IAP, MCIL	Transfer	+1.8pt (69.2 vs 67.4), +1.1 “Avg”, w/ IA-CDDP	(Fu et al., 26 Mar 2025)
PAP, adversarial defense	AA robust acc	37.3% (vs 0.6% for class-agnostic); +5.5% to SOTA	(Xu et al., 6 Feb 2025)
PAT, FSS (1-shot)	mIoU	71.66% (vs. 68.83% previous SOTA)	(Bi et al., 16 Sep 2024)
Mixup-class, DFQ	PTQ Acc (W2A4)	+0.18–0.43% over GenQ, closes gap to real data	(Park et al., 29 Jul 2025)
CLIP-DR, DR grading	avg AUC	0.827 (+3–4 points over prior SOTA)	(Yu et al., 4 Jul 2024)
CPM, AV segmentation	mIoU	+1.9–4.8%, reduces Hungarian assignment entropy	(Chen et al., 7 Jul 2024)

Ablation studies consistently show performance drops when class-aware constraints, prompt grouping, instance-adaptive weighting, or the use of virtual/conditional prompts are removed, confirming their non-trivial contribution across modalities and architectures. Prompt selection based on preliminary label predictions achieves near-optimal robustness with negligible overhead compared to “try-all-prompts” baselines (Xu et al., 6 Feb 2025). In diffusion-based PTQ, synthesizing images from mixup-class prompts simultaneously improves generalization error bounds, converges with lower gradient norms, and leads to more stable calibration (Park et al., 29 Jul 2025).

5. Applications Across Modalities and Learning Paradigms

Vision-Language Adaptation: Language-aware soft prompting and virtual-class injection improve downstream adaptation and zero-shot generalization for VLMs like CLIP (Bulat et al., 2022).
Continual and Incremental Learning: Task- and class-driven prompt separation prevents catastrophic forgetting and stabilizes knowledge transfer without rehearsal (Wang et al., 22 Jan 2024, Fu et al., 26 Mar 2025).
Adversarial Robustness: Explicit modeling of spectral semantics at the class level, together with adaptive prompt weighting, enhances robustness of both naturally and adversarially trained models (Xu et al., 6 Feb 2025).
Fine-grained Recognition and Segmentation: Subcategory-sensitive vision/text prompts and cross-modal fusion mechanisms enable models to accentuate subtle inter-class differences or precisely localize object parts in segmentation (Jiang et al., 2023, Bi et al., 16 Sep 2024).
Data-free Quantization: Prompt engineering via class label fusion produces synthetic datasets with lower polysemy and tighter generalization, directly improving PTQ (Park et al., 29 Jul 2025).
Multi-modal Segmentation and Ranking Tasks: Class-conditional query assignment (audio-visual) and ranking-aware prompts (medical grading) address uncertainty in cross-modal alignment and capture ordinal relationships, respectively (Chen et al., 7 Jul 2024, Yu et al., 4 Jul 2024).

6. Limitations and Future Directions

While class-aware prompting substantially advances model specialization and transfer, current strategies may encounter bottlenecks related to prompt scalability as the number of classes grows, especially with generative prior modeling (e.g., GMMs in CPM (Chen et al., 7 Jul 2024)). The robustness of class-aware prompt selection in regimes with heavy semantic overlap (e.g., fine-grained tasks) or long-tailed distributions necessitates further research on prompt regularization, dynamic scheduling, and natural-language prompt distillation. The integration of spatial reasoning, especially in multimodal or audio-visual domains, and the extension to open-vocabulary and open-set transfer tasks represent active directions. Finally, ablation studies suggest that best performance generally results from architectures that combine class-aware prompting with parameter-efficient adaptation, cross-modal initialization, and instance-aware refinement, motivating further synthesis of these principles.

Class-aware prompting has thus emerged as a foundational paradigm for adaptation, robustness, and efficiency in modern neural networks, supported by diverse, rigorous evidence across vision, language, and multi-modal tasks (Bulat et al., 2022, Fu et al., 26 Mar 2025, Xu et al., 6 Feb 2025, Wang et al., 22 Jan 2024, Jiang et al., 2023, Bi et al., 16 Sep 2024, Park et al., 29 Jul 2025, Chen et al., 7 Jul 2024, Yu et al., 4 Jul 2024).