Papers
Topics
Authors
Recent
2000 character limit reached

LoGoPrompt: Visual Prompt Learning

Updated 1 December 2025
  • LoGoPrompt is a visual prompt learning method that employs synthetic text images as class-specific visual prompts for vision-language models.
  • It reformulates image recognition as a prompt selection task by overlaying class text patches and optimizing a min–max contrastive loss.
  • LoGoPrompt achieves superior few-shot classification and domain generalization without additional trainable parameters in the visual prompt space.

LoGoPrompt is a visual prompt learning methodology for vision-LLMs that repurposes synthetic text images—image patches containing rendered class names—as class-wise visual prompts. LoGoPrompt reformulates image recognition as a prompt selection task: for each test image, the correct class is determined by selecting the most compatible class-specific synthetic text image, overlaid onto the input and scored with a CLIP-style contrastive model. LoGoPrompt uses a min–max contrastive loss to learn this selection mechanism, addressing the chicken-and-egg problem of simultaneously identifying the correct prompt and the class label. This approach outperforms previous visual and text-only prompt learning baselines on few-shot, base-to-new, and domain generalization tasks, with no additional trainable parameters in the visual prompt space (Shi et al., 2023).

1. Problem Formulation and Motivation

LoGoPrompt addresses the task of few-shot image classification and domain generalization in CLIP-style vision-LLMs. Given input space X\mathcal{X}, class set C\mathcal{C}, frozen CLIP image encoder f ⁣:XRdf\colon\mathcal{X}\to\mathbb{R}^d, and frozen text encoder g ⁣:{token sequences}Rdg\colon\{\text{token sequences}\}\to\mathbb{R}^d, the baseline zero-shot prediction uses a textual prompt c\ell_c for each class cc, and returns

p(cx)=exp(cos(f(x),g(c))/τ)i=1Cexp(cos(f(x),g(i))/τ)p(c \mid x) = \frac{\exp\left(\cos(f(x), g(\ell_c)) / \tau\right)}{\sum_{i=1}^C \exp\left(\cos(f(x), g(\ell_i)) / \tau\right)}

where τ\tau is the CLIP model's temperature.

LoGoPrompt introduces synthetic text images VcRh×w×3V_c\in\mathbb{R}^{h\times w\times 3} as class-specific visual prompts. During inference, for each candidate class cc, the synthetic patch VcV_c is overlaid at a random spatial location on image xx to produce xc=Overlay(x,Vc)x_c = \mathrm{Overlay}(x, V_c). The selection objective is to identify

y^=argmaxc{top-Kfrom x}max{p(cx),p(cxc)}\hat y = \arg\max_{c\in \{\, \text{top-}K\, \text{from } x\}} \max \, \bigl\{p(c | x), p(c | x_c)\bigr\}

where KK is the number of top candidate classes identified by text-only CLIP.

The key motivation is to directly encode class semantic priors into the visual modality without additional trainable visual prompt parameters, addressing both limited sample regimes (few-shot) and generalization across domains. This is in contrast to previous visual prompt tuning methods, which have limited generalization and require parameterizing visual prompts (Shi et al., 2023).

2. Construction of Synthetic Text Image Prompts

A synthetic text image VcV_c for class cc is constructed as follows:

  1. Draw random RGB values for background and text color.
  2. Render the class-name string [class]c[class]_c using a standard font (e.g., sans-serif) at the center of a blank image of size h×wh \times w.
  3. Assign hw1/7h \approx w \approx 1/7 of the original input image's shorter side (empirically optimal compared to ratios of $1/14$ or $2/7$).
  4. For each training or inference example, select a random spatial location for overlaying VcV_c onto xx.

This class-wise synthetic patch overlays the label's name in the image data. The approach exploits the multimodal alignment abilities of CLIP, leveraging the fact that class names are already anchor points in CLIP's joint embedding space.

No additional parameters are trained for these visual prompts. All learning remains in the (standard) continuous text prompt embedding space, optionally including CLIP’s LayerNorm, not the image pixel space (Shi et al., 2023).

3. Min–Max Contrastive Learning and Prompt Selection

LoGoPrompt's min–max contrastive loss formalizes learning as prompt selection over visual overlays:

  • For training pair (x,y)(x, y), form positive overlay xy=Overlay(x,Vy)x_y = \mathrm{Overlay}(x, V_y).
  • Select KK hard negative classes c1,...,cKc_1, ..., c_K based on highest CLIP text-only predicted probabilities on xx.
  • Form negative overlays xck=Overlay(x,Vck)x_{c_k} = \mathrm{Overlay}(x, V_{c_k}).
  • Define positive group {(x,y),(xy,y)}\{(x, y), (x_y, y)\}, and KK negative groups {(x,ck),(xck,ck)}\{(x, c_k), (x_{c_k}, c_k)\}.

The min–max contrastive loss is: Lmm(x,y)=logmin{p(yx),p(yxy)}min{p(yx),p(yxy)}+k=1Kmax{p(ckx),p(ckxck)}\mathcal{L}_{\mathrm{mm}}(x, y) = -\log \frac{\min\{p(y|x), p(y|x_y)\}}{\min\{p(y|x), p(y|x_y)\} + \sum_{k=1}^K \max\{p(c_k|x), p(c_k|x_{c_k})\}}

This objective pushes up the weakest of the two positive scores while pushing down the strongest among the negatives. By tackling both xx and xcx_c for each candidate cc and maximizing over them, the model learns to select the best visual prompt even when the class is a priori unknown, effectively solving the chicken-and-egg problem in visual prompt selection. During inference, final classification is by maximizing over this joint prompt selection (Shi et al., 2023).

4. Implementation Details

  • Backbones: CLIP ViT-B/16 or ResNet-50.
  • Visual prompt block size: h=w=17h=w=\frac{1}{7} of input image's shorter side; random placement for each overlay.
  • Prompt length: M=16M=16 tokens for few-shot, M=4M=4 for base-to-new/domain generalization.
  • Negative mining: K=5K=5 hard negatives.
  • Training schedules: Follows CoOp/CoCoOp (16 shots per class, 50 epochs). Data augmentations and learning rate schedules are compatible.
  • No trainable visual prompt parameters: Only the text prompt embeddings (and optionally one LayerNorm) are updated.

Empirical evaluation ablates design settings and shows optimality of the $1/7$ prompt block ratio and random position placement. Removing prompt selection or using paired (x, VcV_c) contrastive loss only degrades accuracy (Shi et al., 2023).

5. Comparative Evaluation and Results

LoGoPrompt is benchmarked on 16 datasets across three settings:

  • Base-to-New Generalization: 16-shot training on base classes, test on base and novel.
    • On 11 standard datasets: CoOp H=71.66H=71.66, CoCoOp H=75.83H=75.83, DPT H=74.28H=74.28, LoGoPrompt H=79.03H=79.03 (harmonic mean).
    • On ImageNet alone: LoGoPrompt achieves 73.66, outperforming CoCoOp (73.10).
  • Few-Shot Classification: LoGoPrompt achieves 76.3% (16-shot, average), CoOp 73.4%, with gains across all shot regimes.
  • Domain Generalization: Trained on ImageNet, evaluated on ImageNetV2/Sketch/A/R: CLIP 59.08, CoOp 61.72, VPT 57.69, LoGoPrompt 63.82.

Ablation studies show that full min–max prompt selection is essential (63.5–66.3% vs. 66.3% full); simply using synthetic text images as data augmentation or without selection underperforms.

Method Prompt Adapter Avg. Acc. (16-shot)
CoOp TP 73.42
VisPrompt VP+TP 61.84
Tip-Adapter TP+Adapter Yes 70.32
LoGoPrompt (ours) VP+TP 76.31
LoGoPrompt+Adapter VP+TP Yes 77.28
  • Block size and placement ablation: $1/7$ block ratio and random location yield best results.
  • No additional model complexity: LoGoPrompt requires no trained visual prompt parameters, making it parameter-efficient compared to contemporary visual prompt tuning approaches (Shi et al., 2023).

LoGoPrompt stands in contrast to earlier visual prompt tuning works such as Visual Prompting [Bahng et al., (Bahng et al., 2022)], Visual Prompt Tuning (VPT) [Jia et al., ECCV 2022], and DPT [Xing et al., (Xing et al., 2022)], which either fine-tune visual tokens or use adapter modules, often incurring additional parameters and lower generalization.

A plausible implication is that LoGoPrompt's use of pixel-space, semantic-rich overlays directly leverages CLIP's pre-trained anchor points, making it robust across few-shot and domain-shift settings. The approach sidesteps additional visual prompt parameterization and complex auxiliary architectures.

Empirical results suggest that effective visual prompt selection and min–max contrastive training are the primary contributors to LoGoPrompt's superior performance, not merely the addition of class-specific synthetic examples.

7. Limitations, Extensions, and Significance

  • Limitation: Direct reliance on class-name English string prompts may be suboptimal for languages or datasets where visual/semantic ambiguity exists.
  • Limitations in ablation: No report on using alternative font styles, multi-word classes, or semantic augmentations.
  • Extensions: Adapting LoGoPrompt's overlay and selection mechanism for segmentation or localization tasks, or combining with recent advances in local-global feature prompting (cf. GalLoP (Lafon et al., 1 Jul 2024)) may yield further improvements.
  • Significance: The work points toward parameter-free, prompt-selection-based multilayer prompt engineering as a practical direction for robust, sample-efficient adaptation in multimodal networks.

LoGoPrompt provides a methodology for robust visual prompt learning by leveraging class-token overlays, with measurable and significant empirically validated gains over both text-based and visual parameter-based prompt baselines (Shi et al., 2023).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LoGoPrompt.