Test-Time Prompt Tuning (TPT)

Updated 24 June 2026

Test-Time Prompt Tuning (TPT) is an adaptation paradigm for vision-language models that optimizes a set of soft prompt tokens at inference using unlabeled test data.
TPT improves performance under distribution shifts and adversarial conditions by minimizing the entropy of aggregated predictions from augmented views with a single gradient update.
Extensions like C-TPT, O-TPT, A-TPT, and SoC introduce geometric regularizers to refine prompt updates, reducing calibration errors and enhancing robustness.

Test-Time Prompt Tuning (TPT) is an adaptation paradigm for vision-LLMs (VLMs), notably those in the CLIP family, enabling on-the-fly prompt optimization using exclusively unlabeled test data. Unlike prompt-tuning methods trained on labeled downstream tasks, TPT operates solely at inference and modifies a small set of “soft” continuous prompt parameters to enhance generalization under distribution shift, class imbalance, or adversarial conditions, while avoiding any full-model fine-tuning. This section details TPT’s foundational methodology, key calibration and robustness challenges, recent regularization advances, and representative state-of-the-art solutions.

1. Underlying Principles and Formalization

At the core of CLIP-style VLMs, image classification is performed by mapping the input image $I$ and each class textual description $c_i$ (“a photo of a [class]”) into a shared embedding space via frozen encoders $f_I$ , $f_T$ , producing normalized features $v=f_I(I)$ and $e_{c_i}=f_T(t_{c_i})$ . Classification logits are given by cosine similarities $s_i = \langle v, e_{c_i} \rangle$ , passed through a temperature-scaled softmax.

Test-Time Prompt Tuning (TPT) introduces a set of $L$ learnable prompt tokens $\theta_p\in\mathbb{R}^{L\times D}$ , prepended to each class template. At inference, for a single test image, $N$ image augmentations $c_i$ 0 are generated. The prompt $c_i$ 1 is updated by minimizing the entropy of the model's class-posterior averaged over these augmentations: $c_i$ 2 where $c_i$ 3 is the predicted probability for class $c_i$ 4 on augmentation $c_i$ 5. Typically, a single gradient update (AdamW, lr~ $c_i$ 6) suffices per image or test batch. The prompt is then reset for the next sample, enabling lightweight per-sample adaptation (Shu et al., 2022).

2. Calibration and Overconfidence Challenges

A prominent challenge in vanilla TPT lies in systematic miscalibration introduced by entropy minimization. By design, TPT aggressively reduces predictive entropy, often leading to output distributions with maximum softmax probabilities that substantially overstate the true empirical accuracy. Quantitatively, this manifests in elevated Expected Calibration Error (ECE)—for instance, on CLIP ViT-B/16, ECE increases from $c_i$ 7 (zero-shot) to $c_i$ 8 (TPT) (Sharifdeen et al., 15 Mar 2025). Reliability diagrams confirm that this overconfidence is most pronounced for difficult or ambiguous samples, undermining trustworthiness in critical deployment domains.

This miscalibration cannot be mitigated with classical, label-reliant post-hoc techniques (e.g., temperature or Platt scaling), since TPT operates without access to validation labels at inference (Yoon et al., 2024).

3. Geometric Regularization Approaches for Calibration

Recent research targets the geometric configuration of class-conditioned text features to counteract overconfidence. The premise: improved dispersion or angular separation between class prototypes makes the model less susceptible to logit crowding and calibration error.

3.1. Text Feature Dispersion (C-TPT)

C-TPT maximizes the average $c_i$ 9-distance (Average Text Feature Dispersion, ATFD) between class text features,

$f_I$ 0

and regularizes prompt updates with

$f_I$ 1

yielding substantial ECE reduction (e.g., $f_I$ 2 on ViT-B/16) without accuracy loss (Yoon et al., 2024).

3.2. Orthogonality and Angular Diversity Constraints

O-TPT enforces explicit orthogonality among class text embeddings, imposing

$f_I$ 3

where $f_I$ 4 is the normalized $f_I$ 5 text feature matrix. The full TPT objective is then

$f_I$ 6

Strongly reducing pairwise cosine similarities decreases ECE (ViT-B/16: $f_I$ 7, an $f_I$ 8 drop vs C-TPT). O-TPT outperforms prior state-of-the-art regularizers, and can be further combined with orthogonality-preserving transforms such as Householder decomposition for incremental gains (Sharifdeen et al., 15 Mar 2025).

Angular diversity constraints, as in A-TPT, maximize the minimum inter-class angular separation: $f_I$ 9 ensuring uniform feature spread on the hypersphere for robust and even lower calibration error (Ahamed et al., 30 Oct 2025).

3.3. Semantic Orthogonal Calibration (SoC)

SoC introduces a Huber-style pairwise regularizer, capping the repulsive force between class prototypes to preserve semantic proximity: $f_T$ 0 where $f_T$ 1. This mitigates the over-repulsion problem of O-TPT and yields best-in-class calibration with no discriminative performance loss (Fillioux et al., 13 Jan 2026).

4. Robust and Efficient TPT Extensions

Addressing adversarial robustness, TPT is reformulated to utilize pointwise rather than marginal (batch-averaged) entropy minimization. R-TPT demonstrates that in the presence of adversarial samples, KL consistency regularization across views is counterproductive, and instead advocates optimizing only per-view entropy. This is coupled with a reliability-based ensembling strategy, where each view is scored by local feature density (cosine-similarity-based), and predictions are aggregated via a reliability-weighted ensemble to downweight corrupted or outlier augmentations (Sheng et al., 15 Apr 2025).

Further, SS-TPT introduces per-view Stability and Suitability (prediction invariance under weak augmentations and feature-space density) to guide both adaptation and inference through a softmax-weighted selection and consistency loss. SS-TPT achieves superior robustness-throughput trade-offs, enabling the use of very few views at minimal accuracy loss (Kim et al., 5 Jun 2026).

To address computational efficiency, Self-TPT reduces adaptation cost by shifting to class-level prompt adaptation with a contrastive prompt learning loss; Test-Time Loss Landscape Adaptation (TLLA) avoids any prompt parameter updates at inference by selecting test augmentations whose loss landscape aligns with that of the tuned prompt "flat minimum" (Zhu et al., 2024, Li et al., 31 Jan 2025).

5. Algorithmic and Implementation Considerations

Key steps in TPT pipelines include:

Prompt initialization: Highly sensitive to initialization; stronger calibration is achieved with attribute-aware initials (e.g., via LLM-derived visual attributes or via flatness-aware prompt pretraining) (Hebbalaguppe et al., 28 Jun 2025, Jang et al., 30 Apr 2026). Flatness-aware pretraining optimizes text prompts for flat loss regions before adaptation, resulting in higher regularization efficacy once integrated into any TPT pipeline.
Augmentation and view selection: Augmentation diversity (e.g., via diffusion-based or dynamically parameterized augmentations), entropy- or anchor-guided filtering, and softmax-based selection are central for both calibration and robustness (Feng et al., 2023, Lei et al., 13 Dec 2025, Choi et al., 14 Apr 2026).
Optimization loop: In practice, a single AdamW update on selected views and, if used, respective regularization terms suffices for robust performance.
Composability: Regularizers can often be "plugged in" to vanilla TPT, DynaPrompt-style online prompt buffers, or open-set prompt-fusion frameworks with minimal adaptation (Xiao et al., 27 Jan 2025, Gao et al., 2024).

6. Experimental Evidence and Comparative Evaluation

Across extensive ImageNet, fine-grained, and OOD benchmarks, geometric TPT regularizers consistently lower ECE by $f_T$ 2 relative to vanilla TPT, at stable or slightly improved accuracy. For example:

Method	Acc. (ViT-B/16)	ECE (%)
Zero-shot	63.84	4.25
TPT	65.09	11.42
C-TPT	64.46	4.97
O-TPT	63.98	4.78
D-TPT	64.72	4.18
FPP-TPT	65.37	4.13

O-TPT and SoC dominate in ECE reduction, while maintaining high accuracy under severe OOD shift and adversarial scenarios (Sharifdeen et al., 15 Mar 2025, Fillioux et al., 13 Jan 2026, Han et al., 10 Oct 2025, Jang et al., 30 Apr 2026).

O-TPT also shows competitive results with robust and open-set extensions. For example, with the orthogonality regularizer added to state-of-the-art prompt frameworks (CoOp, MaPLe), ECE drops from approximately $f_T$ 3 to below $f_T$ 4 (Sharifdeen et al., 15 Mar 2025).

7. Limitations and Future Research Directions

TPT and its extensions are subject to several limitations:

Over-regularization: Excessive prototype repulsion (e.g. in O-TPT) can destroy meaningful semantic proximity, and Huber-style semantically-aware constraints (SoC) or angular diversity objectives (A-TPT) are necessary to balance class separation and calibration.
Single-step adaptation: Most studies employ a one-step update at inference; multi-step or adaptive-regularization schedules may enhance adaptation in heterogeneous or rapidly drifiting domains (Sharifdeen et al., 15 Mar 2025).
Memory and compute: While regularization-based TPT is lightweight, online-buffered (DynaPrompt) or knowledge-bank TPT (HisTPT) introduce memory costs; loss-landscape or self-supervised approaches mitigate gradient computation at test time (Xiao et al., 27 Jan 2025, Zhang et al., 2024).
Extensibility: Multi-modal and dense-prediction TPT, as well as extensions to vision-language tasks (VQA, object detection, segmentation), remain open areas (Yan et al., 1 Feb 2025, Zhang et al., 2024).

Directions under current exploration include adaptive or data-informed regularization strengths, integration with Bayesian/post-hoc calibration, exploration of structured prompt spaces, and theoretical convergence analyses for dynamic prompt-optimization objectives.

In summary, Test-Time Prompt Tuning represents a rapidly evolving paradigm for unsupervised adaptation of VLMs, with geometric, regularization-based extensions (orthogonality, angular diversity, semantic-aware calibration) providing systematic improvements in reliability, calibration, and robustness across benchmarks (Sharifdeen et al., 15 Mar 2025, Fillioux et al., 13 Jan 2026, Ahamed et al., 30 Oct 2025, Han et al., 10 Oct 2025, Yoon et al., 2024). The emerging consensus suggests that prompt geometry—specifically, controlled dispersion and angular separation—is essential to realizing the practical potential of adaptive, label-free test-time vision-language systems.