Visual Prompt Adaptation (VPA)

Updated 7 March 2026

Visual Prompt Adaptation is a parameter-efficient learning paradigm that injects learnable prompts into fixed vision models, enabling efficient adaptation to new tasks while keeping core weights frozen.
VPA methods employ pixel-level, token-level, and region/instance-adaptive approaches to achieve strong performance across classification, retrieval, and dense prediction tasks.
Recent advances optimize prompt placement, regularization, and dynamic adaptation to robustly handle distribution shifts, reduce computational cost, and enhance overall model efficiency.

Visual Prompt Adaptation (VPA) is a parameter-efficient learning paradigm developed to adapt large, pretrained vision models—particularly Vision Transformers (ViTs) and related architectures—to new downstream tasks or domains by learning a small, dedicated set of parameters called visual prompts. These prompts are injected either at the input or at selected feature layers while the core model weights remain frozen. VPA methods achieve strong adaptation performance across classification, retrieval, and dense prediction tasks, often approaching or even surpassing full fine-tuning at a fraction of the computational and storage cost. Modern VPA research addresses both parameter efficiency and the stability of adaptation under distribution shift, with emergent algorithms targeting various use-cases: offline transfer, test-time adaptation, region-sensitive adaptation, and multi-modal or multi-expert scenarios.

1. Formal Taxonomy and Mathematical Framework

Visual Prompt Adaptation encompasses a family of techniques that modify the pretrain–finetune paradigm by injecting learnable parameters—as prompts—into fixed vision architectures (Xiao et al., 15 Oct 2025). Prompt-based methods fall along two main axes:

Injection granularity:
- Pixel-level (input-space): Additive or concatenative learnable overlays at the pixel level (Bahng et al., 2022, Wu et al., 2022), e.g., masked padding, affine/color transformations (Enomoto, 9 Oct 2025).
- Token-level (feature-space): Learnable vectors prepended/interleaved with patch or intermediate features at network layers (Xiao et al., 15 Oct 2025, Zhang et al., 2024).
Prompt type:
- Fixed: Pre-specified hand-crafted signals (e.g., boxes, patterns).
- Learnable: Direct optimization of prompt values via gradient descent (Wu et al., 2022).
- Generated: Dynamic or per-instance prompt generation through auxiliary networks or memory (Xiao et al., 22 Mar 2025, Kunananthaseelan et al., 2023).

Mathematically, for a pre-trained model $f_\theta$ and downstream input $x$ , a visual prompt parameterized by $P$ yields prediction

$\hat{y} = f_\theta\big(\mathcal{P}_P(x)\big)$

where $\mathcal{P}_P$ denotes the prompt operator (additive/concatenative/feature-level). The prompt parameters $P$ are learned to minimize empirical risk on the downstream task while keeping $\theta$ fixed: $\min_P\ \mathbb{E}_{(x, y)}\, \mathcal{L}\left(f_\theta\big(\mathcal{P}_P(x)\big), y\right)$ Common regularizations include sparsity, low-rank constraints, or auxiliary alignment losses to the original feature space (Xu et al., 2023, Yang et al., 2024, Le et al., 31 Jan 2025).

2. Core Algorithms and Methodological Variants

Several algorithmic families have emerged in VPA, each optimizing different aspects of adaptation efficiency and flexibility.

2.1 Pixel-level Prompting

Classic methods learn a universal additive perturbation or a shrink-and-pad template around the input image (Wu et al., 2022, Bahng et al., 2022). Enhanced approaches introduce expressivity via affine, color, and additive components (ACAVP), regularized by data augmentation (TrivialAugment) to avoid overfitting even as parameter count increases (Enomoto, 9 Oct 2025). LoR-VP introduces a low-rank row-column decomposition, sharing information across pixels and reducing parameter count by $18\times$ compared to standard frame-based approaches while improving accuracy (Jin et al., 2 Feb 2025).

2.2 Token-based Prompting (VPT, E²VPT, VAPT, CVPT)

Token-level adaptation leverages learnable prompt vectors injected at various layers.

Shallow VPT prepends tokens to the input sequence (Xiao et al., 15 Oct 2025).
Deep VPT inserts prompts at every block for increased capacity (Han et al., 2023).
E²VPT injects key–value prompts into self-attention and employs pruning to reduce redundant tokens, using as little as $0.07$– $x$ 0 of total parameters (Han et al., 2023).
Visual Adaptive Prompt Tuning (VAPT) replaces static prompt tokens by input-adaptive functions, achieving minimax-optimal sample efficiency and outperforming even full fine-tuning on VTAB-1K and FGVC (+7.3% and +1.0%, respectively) (Le et al., 31 Jan 2025).
CVPT (Cross-Visual Prompt Tuning) introduces cross-attention to semantically align prompt tokens with patch embeddings, surpassing deep VPT by $x$ 1– $x$ 2 points in average accuracy, and closes the gap to adapter-based PEFT methods (Huang et al., 2024).

2.3 Region-, Instance-, and Expert-Adaptive Prompts

AdaViPro optimizes both “what” to add and “where,” using a Gumbel-Softmax-driven region mask to spatially select prompt application, outperforming standard VP by up to $x$ 3 and maintaining efficiency at large prompt widths (Yang et al., 2024).
V²APT applies a VAE to generate instance-adaptive prompts per input, outperforming static deep VPT by $x$ 4 mean accuracy on transfer benchmarks (Xiao et al., 22 Mar 2025).
pMoE (Prompt Mixture-of-Experts) orchestrates prompt tokens from multiple diverse expert models with a learned dispatcher/gating mechanism, yielding substantial gains (e.g., $x$ 5 on VTAB-1K) while maintaining low parameter overhead and broad cross-domain applicability (Mo et al., 26 Feb 2026).

2.4 Test-Time Visual Prompt Adaptation

Recent methods address test-time adaptation under domain shift and without source data:

OT-VP (Optimal Transport-guided VP) learns as few as four prompt tokens at test time via optimization of OT distance aligning source and target feature-label distributions. This achieves state-of-the-art adaptation across several benchmarks with only $x$ 6 trainable parameters ( $x$ 7 of DePT) (Zhang et al., 2024).
VPA (Test-Time) and DePT execute entropy-minimization or pseudo-labeling to tune prompts online, surpassing batch-norm or feature-statistic-based TTA by $x$ 8 to $x$ 9 on OOD and corruption benchmarks (Sun et al., 2023, Gao et al., 2022). OT-VP improves OOD robustness while retaining low latency and parameter count compared to DePT (Zhang et al., 2024).

3. Quantitative Performance and Efficiency

Across standard vision transfer and robustness benchmarks, VPA approaches achieve strong empirical results:

Method	Params Tuned	Avg. Acc. (12 cls tasks)	VTAB-1K Avg (%)	FGVC Avg (%)	OOD Robustness
Linear Probe	0.04–0.08 M	80.3–86.0	79.9	—	Moderate
VP/EVP	0.07 M	76.5–82.5	76.0	—	Good
VPT-Deep	0.09–2.0 M	77.3–89.1	72.0–74.0	89.1	Good
E²VPT	0.32–0.39 M	—	80.0	89.2	Superior
OT-VP	3\,072	—	—	—	SOTA OOD
ACAVP	0.3–0.4 M	83.2	—	—	SOTA (CIFAR10-C)
LoR-VP	10 K	+3.1pp over AutoVP	—	—	+10.6pp on OOD
pMoE	1.3 M	—	80.3	86.1	Broad SOTA

EVP outperforms both shallow VP and LP, and matches or beats FT on several OOD benchmarks (Wu et al., 2022).
ACAVP achieves SOTA VP performance, $P$ 0pp on CIFAR10-C over prior prompt methods (Enomoto, 9 Oct 2025).
OT-VP achieves $P$ 1pp on ImageNet-C vs ERM baseline and outperforms DePT with $P$ 2 of the trainable parameters (Zhang et al., 2024).
E²VPT surpasses full fine-tuning on $P$ 3 VTAB-1K tasks at $P$ 4 parameter count (Han et al., 2023).

4. Application Domains and Extensions

VPA methods are actively extended to diverse modalities and challenging application domains (Xiao et al., 15 Oct 2025):

Compositional Zero-Shot Learning: VAPS leverages a learned prompt repository with similarity-based retrieval and a visual prompt adapter to achieve state-of-the-art compositional reasoning, outperforming previous CZSL techniques (Stein et al., 27 Feb 2025).
Multimodal/Fusion Scenarios: Vision–LLMs (CLIP, LaViP) utilize both text- and pixel-grounded prompts, with language grounding accelerating adaptation and providing open-vocabulary transfer (Kunananthaseelan et al., 2023, Stein et al., 27 Feb 2025).
Cross-Domain Transfer: Medical imaging, remote sensing, point cloud perception, and egocentric video understanding are adapted via expert- or instance-driven prompt tuning (Xiao et al., 15 Oct 2025, Wu et al., 2024).
Cloud–Device Collaboration: U-VPA enables prompt-based continual adaptation in edge-computing settings, transmitting compact prompt updates under strict bandwidth constraints for device model updating (Gan et al., 2022).

5. Algorithmic and Theoretical Advances

Recent work rigorously investigates the statistical properties and expressiveness of VPA:

Expressivity: VAPT unifies prompt tuning and the MoE (Mixture-of-Experts) view, proving that standard static prompt tokens only add bias, whereas input-adaptive prompts increase functional capacity and achieve rate-optimal convergence ( $P$ 5) in sample complexity (Le et al., 31 Jan 2025).
Prompt Placement: Empirical studies reveal that deep prompts and token-level placement at multiple layers increase effectiveness, especially for dense prediction tasks (Xiao et al., 15 Oct 2025, Han et al., 2023).
Regularization & Stability: Prompt overfitting is mitigated via feature reformation losses (Xu et al., 2023), entropy-based sparsity (Yang et al., 2024), or augmentations such as TrivialAugment (Enomoto, 9 Oct 2025).
Region and Mask Learning: Adaptive spatial masking balances accuracy and efficiency, outperforming fixed prompt location methods (Yang et al., 2024), and may generalize to non-rectangular or semantically-driven regions.

6. Current Challenges and Open Directions

Several fundamental and practical questions remain central to VPA research:

Prompt Overfitting & Generalization: Small data or strong domain shifts expose overfitting in learnable prompts. Hybrid prompts, regularization, and generative/instance-adaptive prompts are active research directions (Xiao et al., 15 Oct 2025, Xu et al., 2023, Xiao et al., 22 Mar 2025).
Prompt Selection and Placement: Theoretical and automated determination of prompt size, location, and layer remains unresolved, with most current practice empirical or search-based (Han et al., 2023, Tsao et al., 2023).
Computational Efficiency: Despite low parameter counts, activation memory and prompt-generator costs can limit scalability, especially for dense or large-scale models. Advances in reversible layers, pruning, and prompt distillation are needed (Han et al., 2023, Jin et al., 2 Feb 2025).
Safety and Robustness: Backdoor attacks via prompts and fairness concerns are emerging security considerations. Prompt-specific defenses and fair prompt optimization require additional research (Xiao et al., 15 Oct 2025).
Extensibility to Dense Tasks: Most VPA benchmarks focus on classification; robust adaptation for object detection, segmentation, and video tasks is an emerging frontier (Xiao et al., 15 Oct 2025, Wu et al., 2024).
Continual and Test-Time Adaptation: Lightweight, non-catastrophic updating under continual distributional drift, without access to source data, is a priority for real-world deployments (Zhang et al., 2024, Sun et al., 2023, Gan et al., 2022).

7. Broader Impact and Emerging Applications

VPA has rapidly accelerated vision adaptation research by enabling scalable, lightweight, and interpretable deployment of foundation models. Automated search systems (AutoVP) facilitate best-practice discovery across model backbones, mapping strategies, and prompt dimensions, consistently yielding superior performance over linear probing and prior VP methods in both standard and label-scarce regimes (Tsao et al., 2023). Multi-expert, multi-modal, and dynamic-instance adaptations open new avenues for generalist vision systems, while efficient tune-once–use-everywhere adaptation mechanisms reduce privacy and compute barriers in real-world, resource-constrained environments.

Ongoing theoretical work aims to unify the field and clarify the functional boundaries between different prompt types, placements, and adaptation algorithms. The future trajectory of VPA centers on principled prompt architecture design, scalable domain adaptation, and the safe reliable transfer of visual foundation models across rapidly evolving visual and multi-modal tasks.