SCAMP: Spatio-Temporal Context-Aware Prompting

Updated 17 November 2025

SCAMP is a framework that integrates explicit spatio-temporal encoding, aspect-specific prompt banks, and inter-prompt coordination to reprogram frozen models for complex tasks.
It achieves enhanced accuracy and efficiency in applications such as vision forecasting, urban attribute prediction, graph-based transfer, and traffic scene understanding.
By training only low-cost prompt and adapter modules, SCAMP maintains high transferability while substantially reducing retraining costs compared to full fine-tuning.

The Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) is a family of mechanisms for integrating adaptive, context-aware prompts into large neural network models performing spatio-temporal reasoning. SCAMP methods enable frozen foundation model backbones to be reprogrammed for complex, multi-attribute, spatio-temporal tasks in vision, graph, and multimodal settings. By combining explicit contextual encoding, aspect-specific prompt banks, and inter-prompt coordination modules, SCAMP architectures achieve high empirical performance and parameter efficiency across forecasting, transfer, and scene understanding benchmarks. SCAMP is instantiated in diverse contexts, including the reprogramming of vision transformers (Chen et al., 14 Jul 2025), spatio-temporal attribute prediction (Zhang et al., 2023), prompt-based graph transfer (Hu et al., 21 May 2024), and vision-language traffic scene understanding (Ma et al., 12 Nov 2025).

1. Core Principles of SCAMP

SCAMP mechanisms share foundational principles:

Spatio-temporal context encoding: Explicit representation of both spatial and temporal relationships, including map-based, graph-based, or sequence-based features.
Multiaspect prompt banks: Separate sets of learnable prompt vectors correspond to distinct task aspects (e.g., traffic attributes, temporal flows, scene semantics).
Prompt-based adaptation: Instead of fine-tuning full model parameter sets, SCAMP adapts small prompt banks, injecting context-aware signals into input or intermediate layers.
Inter-prompt coordination: Attention-based modules aggregate and exchange information between prompts, supporting dynamic fusion of multimodal or multi-branch representations.
Frozen backbones: Underlying foundation models (ViTs, vision-language encoders, transformers, GNNs) remain fixed, with adaptation isolated to prompt banks and auxiliary heads.

This design yields scalable, general-purpose frameworks for spatio-temporal machine learning, minimizing retraining cost while maximizing cross-task transferability and context sensitivity.

2. Technical Implementations

2.1 Vision Transformer Reprogramming for Forecasting

In ST-VFM (Chen et al., 14 Jul 2025), SCAMP is realized as follows:

Dual-branch architecture: Input is partitioned into a raw spatio-temporal data branch $(X \in \mathbb{R}^{T\times H \times W \times C})$ and an auxiliary flow branch $(F \in \mathbb{R}^{T \times H \times W \times 2})$ produced by a frozen flow predictor.
Temporal-Aware Token Adapter: Two-layer MLP projects each patchified input branch to a shared token dimension $d$ , augmented by positional and temporal embeddings:

$Z^s_i = a_i + E^{pos}_i + E^{time}_{t(i)}$

with analogous formulation for the flow branch.

Bilateral Cross-Prompt Coordination: In each ViT layer, both branches inject short prompt pools $P^s_\ell, P^f_\ell$ into the attention tokens. Cross-prompt attention modules update each pool via gated fusion:

$P^s_{\ell+1} = g_s([P^s_\ell; C_{f \to s}]),\quad P^f_{\ell+1} = g_f([P^f_\ell; C_{s \to f}])$

Only prompt and adapter parameters are trainable; the VFM backbone is frozen throughout.

Empirical ablation shows removal of temporal adapters or bilateral coordination degrades RMSE by 0.15 and 0.10, respectively, and SCAMP-equipped ST-VFM surpasses UniST by 17–23% in RMSE/MAE on crowd forecasting.

2.2 Multi-Attribute Spatio-Temporal Transformer Prompting

PromptST (Zhang et al., 2023) formalizes SCAMP in transformer-based urban attribute prediction:

Spatio-temporal backbone: $X \in \mathbb{R}^{T\times N \times d}$ is processed with alternating temporal and spatial self-attention, with additive position embeddings.
Parameter-sharing regime: Model weights $\theta$ are shared across $A$ attributes; only output heads $h^a$ and prompt vectors $P^a$ are attribute-specific.
Prompt Tuning: Attribute-specific prompt banks $P^a \in \mathbb{R}^{L\times d}$ are prepended to input and fine-tuned while all other parameters are frozen.
Multiaspect extension: For $M$ context sources, aspect-specific prompt banks $P^{a,m}$ are concatenated, providing dynamic context integration:

$P^a_{multi} = [P^{a,1};P^{a,2};\cdots;P^{a,M}]$

All tokens attend jointly, supporting aspect interaction.

Under this regime, PromptST and SCAMP tuning achieve 1–2% trainable parameter cost per attribute, outperforming full fine-tuning and exhibiting high transferability (e.g., prompt-tuned models match fine-tune performance on unseen attributes with only few-shot data).

2.3 Graph-based Transfer via Spatio-Temporal Prompting

STGP (Hu et al., 21 May 2024) generalizes SCAMP to masked spatio-temporal graph learning:

Unified masked-reconstruction template: Downstream tasks (forecasting, kriging, extrapolation) all require the reconstruction of masked node/time patch signals.
Task-agnostic encoder-decoder: Parallel spatial (Graphormer) and temporal (TSFormer) transformers feed into a learnable gating module; the decoder applies GCN and TCN blocks fused by similar gating.
Two-stage prompt mechanism:

Domain Prompting: Spatial and temporal prompt banks $P_s, P_t$ are inserted at encoder input, tuned to align target domain data.
Task Prompting: Masked and unmasked position prompt banks $P_m, P_u$ are inserted in decoder, tuned for specific tasks.

Ablations indicate prompt-sharing or prompt removal diminishes performance by 5–8% (MAE) and up to 10.7% RMSE, while SCAMP-style prompting requires $\sim$ 6K parameters vs. $>500$ K for conventional transfer learning.

2.4 Vision-Language Scene Understanding with Traffic Data

ST-CLIP (Ma et al., 12 Nov 2025) applies SCAMP for traffic scene understanding:

Dynamic spatio-temporal context representation: GPS trajectories are map-matched and encoded as static and dynamic road segment properties, aggregated with a transformer into a context vector $\bm c_t$ .
Bi-level multi-aspect prompt learning: For $P=4$ aspects (Scene, Surface, Width, Accessibility), each aspect maintains $M=16$ learnable prompt vectors, fused with $\bm c_t$ .
Patch-wise cross-modal attention: Prompt tokens are directed toward CLIP image patch features via MHA to optimize image-text alignment.
Image-wise cross-aspect attention: Relations among aspect-specific prompts are integrated via cross-attentive aggregation.
Final prediction: The composed prompt is fed to CLIP’s text encoder; cosine similarity to image features is scored via a temperature-scaled softmax and per-aspect cross-entropy.

On few-shot traffic scene datasets, SCAMP-enhanced ST-CLIP gains up to 13.1% accuracy (Scene aspect) over strong baselines and produces well-separated feature clusters in t-SNE; each attention module yields meaningful accuracy improvement.

3. Empirical Performance and Efficiency

SCAMP mechanisms are consistently associated with substantial performance gains and efficiency:

ST-VFM (Chen et al., 14 Jul 2025): RMSE reduction of 0.49 over UniST for crowd forecasting; only adapter and prompt parameters require training.
PromptST (Zhang et al., 2023): RMSE gains of 0.5–0.7 across forecasting attributes with only 1–2% additional training cost; strong transferability on unseen attributes/cities.
STGP (Hu et al., 21 May 2024): Reduces MAE/RMSE by 5–10% on METR-LA, PEMS-BAY, and others; prompt banks are $<$ 1% size of full model.
ST-CLIP (Ma et al., 12 Nov 2025): Outperforms all prompt and adapter baselines on Beijing/Chengdu traffic scene understanding by 0.9–14.4% per aspect; inference is practical (∼37 ms/image), and few-shot efficiency is validated.

Efficiency derives from frozen backbone architectures and the limited parameter count of prompt banks, with adaptation cost scaled only with number of aspects/branches, not input dimensionality or model size.

4. Prompt Coordination and Aspect Interaction

A defining characteristic of SCAMP is inter-prompt coordination:

Bilateral cross-prompt coordination in ViTs (ST-VFM) strengthens spatio-temporal branch interactions, reducing error.
Concatenated multiaspect prompts (PromptST) offer flexible aspect-specific context; all tokens attend to each other, promoting rich contextual fusion.
Spatial, temporal, and positional prompt separation (STGP) aligns model priors with domain and task-specific attributes, supporting interpretable adaptation.
Cross-modal and cross-aspect attention modules (ST-CLIP) come after context fusion, yielding synergetic embedding refinement for vision-language correspondence.

This architecture is effective for handling the inherent interdependencies among spatial, temporal, and attribute dimensions in real-world data.

5. Limitations, Challenges, and Future Directions

Several limitations and open questions are acknowledged within SCAMP works:

Static graph assumption (Zhang et al., 2023): Spatial transformers often treat regions as fully connected; dynamic graph extensions are undemonstrated.
Scalability: Quadratic attention cost impedes large grids; sparse or linearized alternatives are suggested as future work.
Prompt overspecialization: Without explicit regularizers (contrastive, orthogonal), aspects risk collapse to redundant representations. Investigating mutual-information constraints is advised.
Uncertainty quantification: Current SCAMP models yield point forecasts lacking confidence intervals.
Efficiency in high-shot regimes: Prompt banks’ parameter efficiency is pronounced in few-shot or transfer learning but less distinctive for high-data settings.
Integration with Bayesian methods: Probabilistic forecasting and calibration remain unaddressed areas.
Real-world deployment: Latency and robustness to noise, especially for scene understanding applications, are not fully benchmarked.

A plausible implication is that continued advances in prompt modularity, attention efficiency, and regularization methods will broaden SCAMP’s applicability in scalable, interpretable, and uncertainty-aware spatio-temporal learning.

6. Relation to Contemporary Research and Context

SCAMP unifies concepts from prompt learning (Zhang et al., 2023, Hu et al., 21 May 2024), transformer-based spatio-temporal modeling (Chen et al., 14 Jul 2025), and multimodality in urban-centric vision-LLMs (Ma et al., 12 Nov 2025). It contrasts with traditional fine-tuning and attribute-specific network training by isolating adaptation to low-rank or aspect-specific prompt banks. SCAMP mechanisms have demonstrated transferability across cities, tasks, and modalities, and parameter savings on the order of 1–2% are validated empirically.

A widespread feature is the retention of shared, task-agnostic model weights, with adaptation orchestrated at the granularity of context encoding and aspect-specific prompting. Benchmarks point to SCAMP’s ability to leverage environmental and situational priors—such as GPS tracks, flow data, or traffic attributes—to achieve enhanced generalization and accuracy in both few-shot and transfer learning regimes.

7. Summary Table: SCAMP Instantiations

Model	Context Encoding	Prompt Coordination	Backbone Frozen?	Tested Task(s)
ST-VFM (Chen et al., 14 Jul 2025)	Patch/flow signals, adapters	Bilateral cross-attn	Yes	ST forecasting
PromptST (Zhang et al., 2023)	Positional, aspect features	Concatenated multiaspect	Yes	Multi-attribute forecasting
STGP (Hu et al., 21 May 2024)	Spatial, temporal, masked	Two-stage: domain & task	Yes	Forecast, kriging, extrapol.
ST-CLIP (Ma et al., 12 Nov 2025)	GPS, segment, traffic stats	Cross-modal & cross-aspect	Yes	Traffic scene understanding

All methods leverage compact prompt modules to adapt frozen backbones for complex spatio-temporal inference, exhibiting strong accuracy, transferability, and parameter efficiency.