SToRI: Semantic Token Reweighting in CLIP
- The paper introduces semantic token reweighting in CLIP’s text encoder to assign nonuniform weights to tokens, enhancing discriminative representation.
- It modifies the self-attention mechanism by scaling attention numerators, preserving human interpretability while achieving performance gains in few-shot classification and retrieval.
- Experimental evaluations demonstrate improved accuracy and sharper attribute emphasis with minimal changes to the pretrained CLIP model.
Semantic Token Reweighting in CLIP (SToRI) is a framework designed to enhance the interpretability and controllability of text embeddings within CLIP-based Vision-LLMs (VLMs). By introducing explicit, semantically-motivated token weighting into the self-attention mechanism of CLIP’s text encoder, SToRI enables fine-grained emphasis on discrete textual elements, reflecting either data-driven discrimination or user-driven preference. This approach addresses a significant limitation in CLIP’s default uniform token treatment, providing new avenues for both downstream performance gains and human-centric interpretability in applications such as few-shot image classification and attribute-centric retrieval (Kim et al., 2024).
1. Motivation and Conceptual Foundations
The standard CLIP text encoder, built on a multi-layer Transformer, processes prompts by generating per-token embeddings and collapsing these into a single vector representation. In this process, all tokens—regardless of semantic saliency—are treated uniformly. However, natural language is inherently asymmetric in information content: some tokens (e.g., “yellow” in “a large owl with big yellow eyes”) are more diagnostic than others.
SToRI introduces the principle of semantic token reweighting, which assigns a nonnegative scalar weight to each token in the input sequence. This reweighting is performed within the self-attention mechanism, remaining entirely in “token space” and preserving alignment with human-interpretable linguistic elements. The goals are threefold:
- Improve text embedding representativeness for specific image distributions (data-driven).
- Enable end-user control and transparency over textual emphasis (user-driven).
- Surface which tokens contribute most to downstream decisions (interpretability).
SToRI distinguishes itself from prompt tuning and residual adapters, which act outside the attention mechanism and yield less interpretable, dense parameter spaces.
2. Mathematical Formulation and Architectural Integration
Given an input sequence , SToRI assigns each token a nonnegative weight (default ). In each modified Transformer self-attention block—starting from layer —the attention mechanism changes from the standard:
to the SToRI form:
This reweighting operates on the values and keys associated with each token, providing a direct and interpretable control knob at the level of individual words or subwords.
Implementation involves modifying only the attention numerators with per-token scalars. No new attention heads or adapters are introduced, and the rest of the CLIP encoder remains frozen. In practice, reweighting is initiated at an intermediate Transformer block (0 out of 12 or 24 for typical models), with negligible impact from the precise choice of 1.
3. Controllability and Interpretability Features
SToRI’s semantic token weights 2 can be governed in two regimes:
- Data-driven control: The 3 parameters are optimized (with positivity constraint) using a cross-entropy loss on a small labeled dataset, keeping all CLIP weights frozen. Only 4 for tokens in prompts are learned, enabling few-shot adaptation and surfacing the semantic elements most correlated with discriminative performance.
- User-driven control: Users can manually assign 5 to emphasize or downplay specific tokens, such as increasing 6 to emphasize hair color, without training.
The ability to manipulate attention in token space ensures that these adjustments remain human-readable, contrasting with the opacity of tuned prompt vectors or adapters. Visualization of learned weights aligns with human intuition, e.g., “striped” for texture identification or class-contrastive attributes in CUB and DTD datasets.
4. Experimental Evaluation and Benchmark Results
4.1 Datasets and Tasks
SToRI performance has been evaluated across:
- Few-shot image classification: ImageNet, DTD, SUN397, Flowers102, Caltech101, Food101, with 1–16 shots per class; using CLIP ViT-L/14 and MetaCLIP ViT-L/14 backbones.
- Preference-based image retrieval: CelebA (attributes) and CUB (bird attributes), where prompts include compositional attribute lists and retrieval is based on cosine similarity between weighted text and fixed image embeddings.
4.2 Metrics and Results
- Classification accuracy: SToRI matches or surpasses task residual prompt tuning (TaskRes) for 1–2 shot classification and maintains parity for 4–16 shots, with typical gains of 0.3–1.3% across all datasets.
- Retrieval: Emphasizing a token (e.g., 7) increases Average Precision (AP), Precision@k, and AUC for relevant images. For example, on CelebA, AP improvement with emphasis is 8, and on CUB, 9.
Ablation studies confirm that SToRI’s performance gains derive from semantically meaningful token emphasis, not parameter count inflation.
4.3 Comparative Analysis
SToRI yields monotonic, interpretable emphasis response owing to the normalization of attention after reweighting, avoiding pitfalls seen in naive prompt-weighting methods that directly scale intermediate token embeddings (such methods saturate or fail under extreme 0).
5. Analytical and Interpretability Insights
Visualization of trained token weights on downstream prompts demonstrates alignment with discriminative content: in DTD, tokens like “striped” receive higher weights for striped textures; in CUB, color adjectives are highlighted based on class contrast. Experiments inserting nonsensical tokens (e.g., “sks”, “pll”) show that SToRI training does not inadvertently allocate importance to spurious tokens, underscoring semantic fidelity.
Block position ablation indicates robustness: the choice of which Transformer block to begin reweighting is not critical, provided a sufficient token range is covered; applying reweighting in only a single block is insufficient.
6. Limitations and Future Directions
SToRI’s limitations are rooted in prompt coverage and foundational model biases. If a text prompt omits key semantic distinctions, reweighting cannot conjure new information. SToRI cannot address or mitigate biases already present in the pretrained CLIP encoder.
Possible directions for extension include adapting the reweighting approach to image-side patch tokens, application to generative vision-LLMs, and joint, interpretable multimodal fine-tuning for both image and text weights. These directions could further enhance controllability and interpretability across modalities (Kim et al., 2024).
7. Relationship to Token Reweighting in Vision Pathways
In parallel, semantic-spatial reweighting has been adopted in the visual pipeline of CLIP for open-vocabulary semantic segmentation under the LHT-CLIP framework (Zhou et al., 27 Oct 2025). While SToRI targets text-side token importance, LHT-CLIP modifies residual-path contribution and attention in late visual Transformer blocks to restore visual discriminability. This suggests a convergent interest in token-level weighting at both linguistic and visual levels within the CLIP architecture, although technical mechanisms and optimization targets differ. A plausible implication is that jointly optimized or explicitly coupled token reweighting across text and vision streams may enable further advances in interpretable, user-controllable VLMs.