CLS Token Attention Steering Prompts (CASP)

Updated 30 January 2026

CLS Token Attention Steering Prompts (CASP) are techniques that introduce trainable biases and prompt vectors to explicitly control the aggregation role of the [CLS] token in Transformers.
They employ strategies like steering bias injection, capsule prompt generation, and multi-CLS approaches to refine attention filters, enhancing discrimination and mitigating catastrophic forgetting.
CASP methods achieve state-of-the-art performance in few-shot class-incremental learning by ensuring high parameter efficiency, robust transfer, and improved task adaptation.

CLS Token Attention Steering Prompts (CASP) are algorithmic mechanisms for explicitly controlling and enhancing the contribution of the global [CLS] token in Transformer-based models, by introducing trainable steering biases or specialized prompt vectors in key positions of the network's attention structure. Originating in recent few-shot class-incremental learning (FSCIL), visual prompt tuning, and prompt-efficient learning, CASP techniques extend to image, language, and multimodal models. The aim is to directly modulate the self-attention dynamics surrounding aggregation tokens such as [CLS], thereby improving discrimination, transfer, and robustness while maintaining high parameter efficiency and mitigating catastrophic forgetting (Huang et al., 23 Jan 2026, Liu et al., 19 Oct 2025, Liu et al., 5 May 2025).

1. Architectural Principles of CASP

CASP methods architecturally center on manipulating the processing of the [CLS] or global aggregation token in Transformer layers. In canonical Vision Transformers (ViT), the [CLS] token is responsible for aggregating sequence-wide semantic information through self-attention. CASP augments this process via the following interventions:

Steering Bias Injection: Introduces trainable D-dimensional prompts—denoted $S_Q$ , $S_K$ , $S_V$ —added to the query, key, and value (Q/K/V) projections of the [CLS] token in every Transformer block. For a given block, with $x_{cls} \in\mathbb{R}^D$ and frozen $W_Q, W_K, W_V \in\mathbb{R}^{D\times D}$ ,

$Q'_{cls} = W_Q x_{cls} + S_Q,\quad K'_{cls} = W_K x_{cls} + S_K,\quad V'_{cls} = W_V x_{cls} + S_V$

These biases redirect the [CLS] token's attention filter towards task-relevant features, enabling explicit control over what information is emphasized or suppressed (Huang et al., 23 Jan 2026).

Instance- and Layer-Adaptive Capsule Prompts: Certain implementations substitute the fixed CLS token by a dynamic "capsule prompt" at each layer, obtained as the sum of trainable task prompts and instance-adaptive means of layer representations:

$S^i = p^i + \text{Mean}(\text{inputs at layer } i)$

This forms an "attention anchor", concentrating bidirectional attention between the prompt and all sequence tokens (Liu et al., 19 Oct 2025).

CLS/Image Prompt Disentanglement: Visual prompt coordination frameworks further split prompts into CLS prompts (steering only the [CLS]) and image prompts (dedicated to local patch tokens), running separate brief self-attention passes to specialize updates for each role (Liu et al., 5 May 2025).
Multi-CLS Token Approaches: For weakly supervised segmentation, multiple class-specific CLS tokens are introduced, one for each possible class, and CASP mechanisms promote one-to-one assignments between tokens and semantic classes via random masking (Hanna et al., 9 Jul 2025).

2. Attention Steering and Perturbation Strategies

CASP systems are augmented by perturbative and regularization mechanisms to ensure robust transfer and generalization:

Attention Perturbation: During training, steering prompts $S_Q, S_K, S_V$ are stochastically perturbed with dropout or Gaussian noise, yielding perturbed prompts $S'_Q, S'_K, S'_V$ :

$S'_Q = \text{Dropout}(S_Q; p),\quad S'_Q = S_Q + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)$

This regularization requires the model to learn attention filters stable against perturbations, ensuring that representations are not overfitted to the base session classes (Huang et al., 23 Jan 2026).

Dynamic Inference-Time Steering: In LLMs, attention steering can be applied at inference-time, enforcing a user-specified minimum attention mass $\psi_{target}$ on certain tokens (e.g., instruction spans or [CLS]) by adaptively biasing raw attention logits:

$A_{ij}' = \text{softmax}_j(L_{ij} + B_{ij}),\quad B_{ij} = \mathbb{I}(j \in S)\log\alpha_i,\quad \alpha_i = \max(1, {\psi_{target}}/{\psi_{current}(i)})$

This adaptive scheme enables plug-and-play emphasis of regions or instructions without retraining (Venkateswaran et al., 17 May 2025).

3. Feature Generalization via Token Mixup and Prompt Matching

CASP solutions integrate explicit feature augmentation strategies to simulate future-task variability:

Manifold Token Mixup: Synthetic "virtual" classes are generated by convexly interpolating shallow token sets from the base session:

$\hat{Z} = \lambda Z + (1-\lambda)Z[\text{idx}]$

These mixed features are propagated through the transformer, and a mixed loss is applied to regularize the network toward smoother decision boundaries and reserved latent capacity (Huang et al., 23 Jan 2026).

Prompt Matching Functions: For image tokens, CASP-linked frameworks can match tokens to prompt pools via cosine similarity, assigning prompt vectors to those most semantically aligned, resulting in more precise, diverse feature extraction (Liu et al., 5 May 2025).

4. Training Protocols, Efficiency, and Empirical Results

CASP is characterized by high efficiency with respect to trainable parameters and compute overhead:

Parameter Isolation: Only the steering bias vectors (and any domain adaptation prompt) are updated during base training; all backbone parameters remain frozen, and no fine-tuning is performed during incremental phases (Huang et al., 23 Jan 2026).
Prototype-Based Incremental Learning: In FSCIL, new class prototypes are constructed as means in CLS-embedding space and appended to an expanding classifier bank. The backbone is fixed for all future sessions, preventing catastrophic forgetting (Huang et al., 23 Jan 2026).
Efficiency Metrics: CASP (ViT-B/16, CUB200) incurs only 0.1MB additional parameter cost—substantially lower than competing prompt, adapter, or privilege-based PEFT methods. It also avoids grid search for prompt length present in alternative approaches (Huang et al., 23 Jan 2026, Liu et al., 19 Oct 2025).

Method	Trainable Param (MB)	CUB200 A_avg (%)	CIFAR100 A_avg (%)	ImageNet-R A_avg (%)
CASP	0.1	86.4	90.2	77.4
SEC-F	1.7	84.9	—	76.7
ASP	3.0	—	—	—
PriViLege	16.3	—	—	—

CASP exhibits reduced base-to-last session forgetting (A_B – A_L), indicating strong preservation of previously learned representations (Huang et al., 23 Jan 2026).

5. Application Domains and Variants

CASP has been deployed across a diverse range of tasks:

Few-Shot Class-Incremental Learning (FSCIL): Attains state-of-the-art accuracy in few-shot increments on CUB200, CIFAR100, and ImageNet-R without incremental fine-tuning (Huang et al., 23 Jan 2026).
Prompt-Efficient Fine-Tuning: Capsule prompt variants show that a single adaptive vector per layer achieves or surpasses full fine-tuning on several NLP benchmarks like SuperGLUE, using only 0.004% of model parameters—markedly outperforming other PEFT techniques (Liu et al., 19 Oct 2025).
Visual Prompt Specialization: Token-Coordinated Prompt Attention demonstrates that separating CLS and patch prompts improves class discrimination and feature diversity for vision tasks (Liu et al., 5 May 2025).
Weakly Supervised Semantic Segmentation: Multiple CLS token CASP instantiations enable pseudo-mask extraction with competitive mIoU, outperforming multi-stage and CAM-based WSSS baselines (Hanna et al., 9 Jul 2025).
Instruction-Following in LLMs: Dynamic inference-time CASP enables users to "spotlight" instructions or segments, improving syntactic and multi-turn instruction-following robustness (Venkateswaran et al., 17 May 2025).

6. Broader Context, Extensions, and Limitations

CASP conceptualizes the [CLS] token as a human-like attention filter: its steering prompts allow dynamic, task-aware re-weighting of the attention field in a model. This motivates several directions:

Multimodal and Task-Adaptive Extensions: The primary results are in unimodal visual and LLMs; future CASP research may focus on cross-modal transformers, variable-way FSCIL, or patch-level manifold mix approaches (Huang et al., 23 Jan 2026).
CLS Token Specialization: Disentangling CLS from local (patch or word) tokens and providing dedicated biasing has empirically been shown to yield +0.7–1.5 pp gains over single-prompt or undifferentiated prompt methods (Huang et al., 23 Jan 2026, Liu et al., 5 May 2025).
Limitations: All known CASP implementations depend on clear summary (CLS) tokens; performance can degrade if steering is applied too early/late in deep networks (Wu et al., 2023) or if prompts overfit to the base data distribution (mitigated by perturbations and mixup).

In summary, CLS Token Attention Steering Prompts provide a paradigm for principled, parameter-efficient global attention modulation within transformers. By focusing on explicit control of the [CLS] aggregation mechanism—and coupling this with prompt perturbation and mixup—they consistently outperform previous methods in transfer, adaptation, and robust continual learning benchmarks, with high computational efficiency and generalizability across modalities (Huang et al., 23 Jan 2026, Liu et al., 19 Oct 2025, Liu et al., 5 May 2025, Venkateswaran et al., 17 May 2025, Hanna et al., 9 Jul 2025, Wu et al., 2023, Mao et al., 2022).