Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prompt-based Tuning: Modular Adaptation

Updated 20 May 2026
  • Prompt-based tuning is a parameter-efficient adaptation paradigm where small trainable prompt vectors steer a frozen model for new tasks.
  • The method employs modular prompt design, allowing decentralized and privacy-preserving training with flexible prompt composition at inference.
  • Empirical results demonstrate that APT achieves near monolithic performance with minimal overhead, supporting continual and incremental learning.

Prompt-based tuning is a parameter-efficient adaptation paradigm in which a (typically large, pretrained, and frozen) model is steered towards new downstream tasks by prepending one or more small, trainable prompt vectors, rather than updating the entire model’s parameters. Prompts may be learned individually for specific data sources and then composed at inference, enabling modularity, continual learning, privacy-preserving customization, and efficient “à-la-carte” construction of tailored models. In APT (À-la-carte Prompt Tuning), each prompt is trained in isolation on its respective data and can be flexibly combined post hoc without retraining, achieving performance within a few percent of monolithic fine-tuning and competitive results in continual learning (Bowman et al., 2023).

1. Mathematical and Architectural Foundations

Let f0:RH×W×CRdf_0:\mathbb{R}^{H\times W\times C}\rightarrow\mathbb{R}^d denote a frozen transformer backbone (e.g., ViT-B/16) of LL layers. For image xx, tokenization yields patch embeddings z(1...N)z^{(1...N)} and a class token z(0)z^{(0)}: z0=[z(0),z(1),...,z(N)]R(N+1)×dz_0 = [z^{(0)}, z^{(1)},...,z^{(N)}] \in \mathbb{R}^{(N+1)\times d}.

Each data source or task DiD_i is associated with a “soft” prompt p(i)p^{(i)}:

  • Shallow prompt: Learnable vector prepended at input.
  • Deep prompt: Auxiliary tokens m(i)Rdmem×dm_\ell^{(i)}\in \mathbb{R}^{d_{mem}\times d} at each layer \ell, with LL0. The full prompt has LL1 tokens.

At each transformer block, a structured attention mask ensures that standard tokens do not interact with prompt tokens from other sources, and prompt tokens only attend to specified context within their scope.

Output from each prompt’s tokens (LL2) is passed to a dedicated linear head LL3, producing logits for classification. Training optimizes the cross-entropy loss on LL4, updating only LL5 and LL6; backbone weights remain frozen.

2. Modular Training and Decentralization

APT supports fully decentralized training:

  • Each prompt is optimized in isolation on its data source, potentially on different hardware, timeframes, or organizational domains.
  • No synchrony or overlap is required; data privacy is preserved as each prompt “remembers” only the information it was trained on.

This paradigm is especially suited for federated learning, regulatory scenarios where data cannot be pooled, or settings necessitating effective data deletion (privacy by design) (Bowman et al., 2023).

3. Composition at Inference and À-la-carte Learning

At inference, users specify a set of source indices LL7. The corresponding prompts LL8 are concatenated; the backbone processes input tokens plus all selected prompts in one pass. At each layer, the attention pattern ensures independence of non-overlapping prompts. Final outputs LL9 are combined, typically by averaging their logits:

xx0

APT-Weight (APT-W) can further re-weight contributions by embedding-space proximity to K-means prototypes of each prompt’s source distribution: xx1 where xx2. This is especially useful for class-incremental or domain-incremental learning.

APT's composability enables model customization to user access rights or data-license constraints: including or excluding prompts at runtime without retraining (Bowman et al., 2023).

4. Continual and Incremental Learning

Prompt modularity naturally supports continual learning. As new data sources or tasks arrive, new prompts are trained on these increments and appended to the prompt set. Each prompt is memory- and compute-light. Forgetting is effected by simply removing a prompt: there is no residual data in the backbone or other prompts.

On continual learning benchmarks such as Split CIFAR-100 and CORe50, APT attains state-of-the-art performance among methods that do not use replay buffers or exemplar storage. For instance, APT achieves 83.63% on Split CIFAR-100 and 90.89% on CORe50, with APT-W improving further to 85.21% and 91.14%, respectively (Bowman et al., 2023).

5. Empirical Results and Computational Efficiency

APT achieves accuracy within 5% of a monolithic model trained on the union of all data sources, with similar train and inference cost. For a large number of data shards (e.g., 20), error increases by less than 5% relative to the paragon (jointly trained) model. Out-of-domain tasks see prompt tuning outperform head-only ensembling and approach full fine-tuning results, demonstrating that prompts encode richer features than the classification heads alone (Bowman et al., 2023).

Empirical evaluations demonstrate:

  • Each prompt adds only ≈0.06% of backbone parameters. For 20 prompts: ≈1% storage overhead.
  • Inference with 20 prompts increases latency by less than 5% compared to no prompts; traditional ensembling would require 20 full forward passes.
  • Forgetting is robust: sequential removal of up to half the prompts reduces accuracy by only ≈5%.
  • When individual prompts are trained on small or highly heterogeneous data splits and score below 50% alone, their composition via APT can achieve >80% accuracy.

6. Implementation Details and Hyperparameters

APT uses a ViT-B/16 backbone, with input resolution 384, patch size 16, and deep prompting (prompt tokens at each layer). Each prompt is trained for 80 epochs (paragon for 150) with AdamW, weight decay 0.02, cosine decay schedule, and batch size 8. Standard augmentations and RandAugment are applied (Bowman et al., 2023).

Each prompt requires only storing its tokens (~61×768 parameters, ~46k dimensions). Training can be distributed arbitrarily.

7. Applications and Implications

APT supports privacy-preserving model construction, tenant- or license-specific customization, efficient data removal, and continual task or domain expansion. It is tailored for deployment contexts where retraining backbone models is expensive or prohibited due to regulatory or privacy constraints.

A plausible implication is that the APT paradigm and related modular prompt-based approaches can generalize to multi-modal, federated, and hierarchical tasks, provided an underlying structured attention mechanism and prompt isolation are maintained.


References:

  • "À-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting" (Bowman et al., 2023)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prompt-based Tuning.