Papers
Topics
Authors
Recent
Search
2000 character limit reached

ATPrompt: Anchored Textual Prompt Learning

Updated 21 February 2026
  • ATPrompt is a paradigm that embeds universal attribute tokens into textual prompts, expanding vision-language models beyond category-only representations.
  • It introduces a differentiable attribute search over a large candidate pool, optimizing attribute selection to improve image–text alignment and zero-shot transfer.
  • Empirical evaluations on 11 datasets demonstrate enhanced harmonic mean performance and domain generalization compared to standard methods like CoOp and CoCoOp.

Attribute-Anchored Textual Prompt Learning with Anchored Attributes (ATPrompt)

ATPrompt refers to the Attribute-anchored Textual Prompt learning paradigm for vision-LLMs (VLMs), designed to improve the generalization and transferability of prompt-based methods. ATPrompt expands prompt learning beyond rigid category-centric formulations by explicitly embedding universal attribute information and optimizing attribute selection. This approach was introduced to address the core limitation of aligning visual features only with known classes, which constrains zero-shot transfer to novel categories (Li et al., 2024).

1. Motivation and Conceptual Framework

Standard textual prompt-learning methods in VLMs, such as CoOp, concatenate learnable soft tokens [T1,,TM][T_1, \ldots, T_M] with a hard class token to compose prompts of the form “… [T1][TM][T_1] … [T_M] [CLS]”. The training objective compels alignment of image representations strictly with those of known class names, reducing performance on unseen (novel) classes. In contrast, human cognitive processes often rely on attributes—color, shape, texture, etc.—to describe unfamiliar objects.

ATPrompt integrates a compact set of universal attribute tokens (e.g., [color], [shape]) as explicit anchors in the learnable textual prompt. This structurally expands the prompt space from its original one-dimensional, category-only manifold into a higher-dimensional attribute–category hybrid, thus facilitating more robust image–text alignment capable of generalizing beyond base categories (Li et al., 2024).

2. Mathematical Formulation

Let dd denote the embedding dimension and (hI,hT)(h_I, h_T) the image and text encoders of a frozen VLM. ATPrompt constructs the prompt as follows:

  • Fix a selection of nn attributes, each represented by a frozen embedding ajRda_j \in \mathbb{R}^d.
  • For each attribute jj, learn a soft prompt TjattrRm×dT^{\text{attr}}_j \in \mathbb{R}^{m \times d}; learn a class-related soft prompt TclsRk×dT^{\text{cls}} \in \mathbb{R}^{k \times d}.
  • For class cc with embedding cRdc \in \mathbb{R}^d (from the [CLS] token), form

Pc=[T1attr;a1;;Tnattr;an;Tcls;[CLS]]R(n(m+1)+k+1)×d.P_c = [\, T^{\text{attr}}_1; a_1; \ldots; T^{\text{attr}}_n; a_n; T^{\text{cls}}; \text{[CLS]} \,] \in \mathbb{R}^{(n \cdot (m+1) + k + 1) \times d}.

  • The encoder produces text features wc=hT(Pc)w_c = h_T(P_c), which are used for classification via

p(cx)=exp(cos(u,wc)/τ)iexp(cos(u,wi)/τ)p(c|x) = \frac{\exp(\cos(u, w_c)/\tau)}{\sum_{i} \exp(\cos(u, w_i)/\tau)}

with u=hI(x)u = h_I(x) and learnable soft tokens θ\theta.

The cross-entropy objective is

Ltrain(θ)=(x,c)Dlogp(cx;θ),L_{\mathrm{train}}(\theta) = \sum_{(x,c) \in D} -\log p(c|x; \theta),

where only soft-token parameters are optimized.

Attribute selection is posed as a differentiable search over a large candidate pool V\mathcal{V} (size MnM \gg n) of possible attribute subsets viv_i. ATPrompt assigns softmax-relaxed weights αi\alpha_i to each candidate subset and alternates the minimization of the training loss over soft prompts with the minimization of validation loss over α\alpha. After alternation, the attribute subset with the maximum αi\alpha_i is selected and fixed for prompt learning.

Pseudocode outline:

1
2
3
4
5
6
7
8
Initialize θ random, α = 0
for epoch in 1E_search:
    # Update soft prompts on D_train
    update θ  θ  η_θ L_train(θ, α)
    # Update attribute weights on D_val
    update α  α  η_val _α L_val(θ, α)
end
Select v_hat = argmax_i α_i

This approach, unlike static or hand-engineered attribute sets, systematically discovers attributes that maximize discriminability and transfer for the downstream task (Li et al., 2024).

4. Pipeline Integration and Computational Considerations

ATPrompt does not modify the VLM backbone (e.g., CLIP); additional tokens are inserted into the text input stream. Two variants are supported:

  • Shallow: Attributes injected only at initial input to the text transformer.
  • Deep: Attribute tokens persist through all layers, while class-soft tokens are dropped and re-inserted layer-wise.

Training schedules, batch sizes, and augmentation strategies are inherited from baselines. Compute overhead is minimal: attribute tokens add typically <1<1k parameters and require no additional forward/backward passes beyond standard prompt learning. Attribute search is performed once (≈40 minutes on a single V100 GPU). Final training cost is comparable to, and never exceeds double, the base prompt method.

5. Empirical Evaluation

ATPrompt was assessed on 11 classification datasets: ImageNet-1K, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVC-Aircraft, SUN397, DTD, EuroSAT, UCF101. Metrics include accuracy on base and novel classes and their harmonic mean (HM), as well as cross-dataset generalization.

  • Averaged HM improvements vs. five baselines: CoOp (+2.99%), CoCoOp (+2.12%), KgCoOp (+0.57%), MaPLe (+0.55%), PromptSRC (+0.21%).
  • On ImageNet: CoOp with ATPrompt achieves 73.33% HM vs. 71.92% for baseline.
  • On domain-shifted ImageNet variants, gains are observed for all major baselines.
  • Ablations show optimal performance using short soft prompts (m,km, k), end-position class tokens, “deep-drop” for class tokens only, and that searched attributes only slightly depend on order (±0.2\pm0.2\% effect). Searched attributes consistently outperform random or hand-picked alternatives.

6. Generality, Limitations, and Prospects

ATPrompt is a flexible plug-in for any textual prompt learner, providing improved generalization to both known and unknown classes via explicit attribute anchoring. Nevertheless, it assumes explicit, human-interpretable attributes and relies on LLM-curated candidate pools plus a small validation set for attribute search.

Potential future directions include automated attribute extraction using chain-of-thought reasoning or multimodal LLMs, as well as replacing fixed attribute tokens with fully learnable embeddings to enable richer, implicit attribute discovery (Li et al., 2024).


References

  • Advancing Textual Prompt Learning with Anchored Attributes (Li et al., 2024)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ATPrompt.