Stacking Prompt: A Modular A-la-carte Approach
- Stacking Prompt is a modular prompt-based framework that assembles independent, task-specific modules to achieve near 5% accuracy loss relative to joint training.
- It employs structured attention with isolated prompt and memory tokens, ensuring each prompt module only accesses its own features to prevent cross-talk.
- The approach enables efficient extension and authorized data aggregation without retraining the backbone, supporting continual and flexible deployment.
A-la-carte Prompt Tuning (APT) is a compositional prompt-based learning framework for vision transformers that enables the modular assembly of bespoke models via “prompt stacking.” In contrast to monolithic fine-tuning or naïve prompt concatenation, APT constructs models from a collection of independently trained prompt modules—each adapted to a distinct data source—so that subsets can be flexibly combined at inference depending on user access or deployment needs. This “a-la-carte learning” paradigm facilitates authorized data aggregation, seamless extension, and efficient large-scale deployment, attaining accuracy within 5% of union-trained models and state-of-the-art results on class- and domain-incremental learning benchmarks (Bowman et al., 2023).
1. Modular Architecture: Backbone and Prompt Modules
APT uses a fixed transformer backbone, typically ViT-B/16 pre-trained on large-scale data such as ImageNet21k. Each image is patch-embedded, yielding tokens , where is the class token, and are patch tokens.
For each data source , APT learns:
- A prompt token sequence , prepended to the input.
- A per-prompt classifier head .
A structured-attention mask is applied throughout the ViT backbone:
- Prompts do not attend to each other (no cross-talk).
- Each prompt attends only to backbone tokens and its own small set of “memory” tokens per layer (with , representing <0.06% of the backbone parameters).
This modularization ensures isolation between prompt modules and supports independent training and arbitrary recombination.
2. Mathematical Formalism: Stacking Prompts
Given a user’s authorized subset of sources , APT stacks their initial prompts by simple concatenation: 0 The full sequence 1 is propagated through the backbone and structured-attention layers,
2
At output, each prompt’s final state 3 (for 4) is scored through its own head: 5 Inference is performed by ensembling per-prompt predictions, usually via unweighted average,
6
or via APT-Weight (APT-W), a data-dependent softmax gating: 7 with 8 and 9 cluster centroids for prompt 0.
3. Training Protocol and Isolation
Each prompt module 1 is trained in isolation on its respective dataset 2 by minimizing
3
with the backbone 4 frozen. Structured attention ensures that each prompt is exposed only to its corresponding memory tokens and backbone features. Training is parallelizable and supports asynchronous updates on disjoint data and compute resources.
By design, stacking these prompts (along with their corresponding heads) at inference never introduces destructive interference. Each prompt’s functional mapping—and its learned head—remains compartmentalized, so the ensembling of outputs reconstructs the performance of a jointly trained prompt with minimal degradation (empirically within 55% of “paragon” joint training).
4. Computational and Storage Efficiency
APT achieves favorable scaling for both computational cost and memory:
- Training: One prompt-training pass per data shard. Total work 6; parallelizable and identical in aggregate to monolithic joint training if datasets partition the union.
- Inference: APT-structured attention incurs cost 7 per forward pass, compared to 8 for 9-ensemble full backbones. Adding prompts increases cost linearly, not quadratically.
- Storage: Each prompt + memory block is 0 parameters, typically <0.06% of the backbone per source; a full ensemble would require 1 the backbone’s storage.
- A-la-carte property: Arbitrary addition or revocation of data-source prompts does not require retraining or model reassembly—a unique capability among prompt-based continual learners.
5. Theoretical Motivation and Empirical Validation
ATH is architected to avoid destructive prompt interference and maximize compositionality:
- Each prompt learns from only its data; the structured mask enforces independence.
- Memory tokens act as a restricted “side channel” to adapt backbone features, preventing mutual query contamination.
While no explicit error bounds are derived, empirical results show <5% accuracy loss versus paragon prompts trained on the data union. Experimental results include:
- Two-shard splits: APT matches or slightly exceeds joint (“paragon”) accuracy, outperforming naïve concatenation or average ensembling.
- Sharded splits (up to 20 random partitions): Average drop 2 in-domain.
- Continual learning: APT-W achieves 85.21% on Split CIFAR-100 and 91.14% on CORe50, outperforming contemporaries like L2P on class-/domain-incremental benchmarks.
6. Practical Implications and Applications
A-la-carte learning enables user- and scenario-specific model construction:
- Policy-conscious deployment: Models can be constructed for users reflecting permissible data access, ensuring data isolation and privacy.
- Efficient extension: New data-source prompts can be added without global retraining.
- Resource flexibility: Distributed and asynchronous prompt training permits diverse compute and administrative environments.
Unlike conventional ensembling or re-training, a-la-carte models are assembled via simple prompt stacking and output aggregation, with negligible computational or engineering overhead compared to monolithic fine-tuning.
7. Comparison to Traditional and Advanced Prompting
Compared to naïve prompt concatenation (which often degrades in multi-source settings due to prompt interference), or full backbone ensembling (computationally intractable at scale), APT occupies a unique design space. By leveraging prompt compartmentalization, structured attention, and linear head aggregation, it couples composability and scalability with high accuracy.
Continual learners benefit especially from the ability to “stack” prompts to represent newly authorized data or revoke access by prompt removal—yielding models that remain within 5% accuracy of union-trained monoliths at similar cost. This a-la-carte paradigm underpins state-of-the-art performance on several class- and domain-incremental learning tasks, and represents a practical, theoretically motivated, and highly modular approach to prompt-driven transfer and adaptation in transformer vision architectures (Bowman et al., 2023).