Continuous Prompts in NLP

Updated 10 December 2025

Continuous Prompts are soft, trainable vectors in embedding space that guide frozen models without modifying core parameters.
They employ methods such as prefix-tuning, deep prompts, and linear combinations to achieve efficient fine-tuning and task transfer.
Their design supports continual learning, interpretability strategies, and real-time analytics in varied NLP applications.

A continuous prompt (CP), also called a soft prompt, is an array or sequence of trainable vectors injected into a frozen pretrained model to steer downstream behavior, replacing or augmenting traditional discrete (textual) prompts. CPs enable parameter-efficient fine-tuning, instance-level control, and persistent adaptation across a broad spectrum of NLP and LLM architectures. Unlike discrete prompts, which are constrained to the model’s vocabulary, CPs are optimized directly in embedding space and can be composed, interpolated, or controlled in ways unattainable through natural language alone. The rapid development of CPs has motivated innovations in task transfer, interpretability, continual learning, stream analytics, and fine-grained prompt engineering.

1. Mathematical Foundations and Model Architectures

Let $d$ denote the dimension of an embedding space for a pretrained Transformer or sequence model. A continuous prompt of length $L$ is a parameter matrix $P \in \mathbb{R}^{L \times d}$ . These vectors are prepended (or inserted as layerwise key/value prefixes) into the input token stream, yielding a prompted input

$[\mathbf{p}_1, \ldots, \mathbf{p}_L, x_1, \ldots, x_n]$

where $x_i$ are token embeddings for the user input sequence. This mechanism is universal: CPs appear as input-level “virtual tokens” (Li et al., 2021), layerwise K/V prefixes (prefix-tuning), or deep prompts appended before each block (Lee, 2023).

During training, only $P$ and possibly small adapter layers are updated via gradient descent on the downstream task loss (classification, generation, retrieval, etc.), with LM parameters $\phi$ frozen. For classification, the cross-entropy loss $\mathcal{L}_{\text{CE}}$ is typically

$\mathcal{L}_{\text{CE}} = -\sum_{(x_i, y_i) \in \mathcal{D}} \log p_\phi(y_i \mid \mathbf{p}_1,\ldots,\mathbf{p}_L, x_i)$

with gradients flowing only to $P$ .

Variants include:

Prefix-Tuning: Layerwise soft prompts serving as additional K/V keys for each attention block, with two $k \times d$ matrices per layer (Li et al., 2021).
Layerwise Deep Prompts: Per-layer insertion of prompt vectors $v \in \mathbb{R}^{L \times m \times d}$ , as with D2CSE (Lee, 2023).
Linear Hybrid CPs: Construction from linear combinations of $k$ fixed discrete prompt embeddings $E(d_i)$ with learned weights $w_i$ : $p = \sum_{i=1}^k w_i E(d_i)$ (Passigan et al., 2023).
ControlPE and CP Weighting: Adapter or LoRA-based CPs with a real-valued strength parameter $\alpha \in [0,1]$ , yielding $W_\alpha = W + \alpha (BA)$ for low-rank update matrices $A, B$ (Sun et al., 2023).

2. Optimization Protocols and Parameter Efficiency

Training of CPs is highly parameter-efficient. Only the CPs (and sometimes a classification head or encoder module) are updated, freezing the backbone. This enables:

Storage of many tasks/users: Each task uses 0.1–2% of model parameters (e.g., $P$ with $L = 10$ –$100$, $d = 512$ –$1,024$) (Li et al., 2021).
Catastrophic forgetting mitigation: By never overwriting past prompts and optionally concatenating them (progressive prompts), prior task knowledge is preserved (Razdaibiedina et al., 2023).
Continual and replay-free learning: Progressive and complementary prompt frameworks maintain one prompt per task (and optionally, one shared "neocortex" prompt) for lifelong adaptation (Razdaibiedina et al., 2023, Zhang et al., 27 May 2025).
Memory efficiency: For large LMs (e.g., BERT-base, 110M parameters), a CP encoder and head may use ≈2.6M trainable parameters, <1% of the full dual-PLM designs (Lee, 2023).
Batch- and operator-level optimization: In streaming settings, batching multiple inputs into a single prompt or fusing operators offers throughput improvement at bounded accuracy loss (Chen et al., 3 Dec 2025).

Optimization typically employs AdamW with moderate learning rates (e.g., $5 \times 10^{-5}$ to $1 \times 10^{-2}$ ), prompt-specific tuning epochs (commonly 5–20), and strategic initialization (e.g., summary or task-word embeddings) (Li et al., 2021, Lee, 2023).

3. Interpretability and the Challenge of Waywardness

The non-discrete nature of CPs creates fundamental barriers for interpretation. Empirical and theoretical analyses demonstrate:

Non-faithful d-projection: Nearest-neighbor projection of CP vectors to tokens (d-proj) is unreliable. For any discrete prompt $p_d$ , there exists a CP $\tilde{p}_c$ that (1) projects to $p_d$ (i.e., $d$ -proj $(\tilde{p}_c) = p_d$ ) and (2) achieves near-optimal task loss (within <2% drop), regardless of whether $p_d$ is semantically relevant (Khashabi et al., 2021). This "prompt waywardness" derives from the high volume of Voronoi cells in $\mathbb{R}^d$ and the expressivity of transformers.
Linear CPs for interpretability: Restricting $p$ to a span of $k$ fixed discrete prompt embeddings $E(d_i)$ (for human-designed $d_i$ ), and predicting $w_i$ per input, makes the CP interpretable via its weight vector: positive weights reveal helpful prompt strategies (e.g., pseudocode, analogy), negative weights inhibit maladaptive ones (e.g., poetry for science questions) (Passigan et al., 2023).
Concept-Bottleneck Decomposition: Any soft prompt $P \in \mathbb{R}^{\ell \times d}$ admits a factorization $P = AC$ , with $C \in \mathbb{R}^{k \times d}$ a small bank of interpretable “concept” phrase embeddings (selected from an LLM-generated pool, e.g., GPT-4o) and $A \in \mathbb{R}^{\ell \times k}$ learned coefficients. This maintains accuracy while exposing human-interpretable explanations (Chen et al., 2 Dec 2024).
Inference-time textual elicitation: Using patching-based frameworks like InSPEcT, CPs may be “decoded” into human-readable task descriptions (class lists, high-level rationale) directly from model representations. Fidelity correlates with downstream classification accuracy; as a prompt’s test accuracy improves, so does the faithfulness (ROUGE-1, class rate) of descriptions (Ramati et al., 15 Oct 2024).

4. Continual and Lifelong Learning with CPs

CP-based frameworks have proved highly effective for continual learning:

Task-specific prompt banks: Each new task $T_t$ is assigned a fresh CP $P_t$ ; all prior $P_1,\ldots,P_{t-1}$ are kept frozen. At inference, the active prompt (or concatenated bank) steers the frozen base model (Razdaibiedina et al., 2023).
Progressive Prompt Concatenation: Concatenating all past prompts enables both catastrophic forgetting avoidance and positive forward transfer (Razdaibiedina et al., 2023). If $m$ is prompt length and $d$ is dimension, total parameters after $k$ tasks is $kmd$ , still orders of magnitude less than full fine-tuning.
Complementary Memory Systems: Dual-prompt systems like InfoComp introduce both a private prompt $P_k$ (task-specific, hippocampal) and a shared prompt $S$ (task-invariant, neocortical), updated with mutual information-based losses to maximize knowledge retention and transfer. Empirical results on the 5-task and 15-task CTC benchmarks show InfoComp achieves up to +2.7 pp over previous best prompt-based methods, with average final accuracy $80.0\%$ on 5-task CTC (Zhang et al., 27 May 2025).
Memory and replay: Even in dialog state tracking, continual prompt tuning with replay from a small exemplar memory per task and memory-guided backward transfer achieves high joint goal accuracy (up to $61.2\%$ on 15-task DST), with positive backward transfer and forward transfer (Zhu et al., 2022).

5. Applied Variants: Continuous Control, Streaming, and Sentence Representations

Continuous prompt engineering extends to numerous practical scenarios:

Continuously Controllable Prompt Engineering (ControlPE): CPs as LoRA adapters with a strength parameter $\alpha$ enable smooth, instance-level control. For example, $\alpha=0$ disables a prompt effect; $\alpha=1$ realizes its full action. This allows, e.g., incremental tuning of response length, refusal behavior, or chain-of-thought strength. In tasks such as DocQA, increasing $\alpha$ linearly increases refusal rate; for math QA, optimal $\alpha$ is often submaximal ( $\approx0.8$ ), reflecting nonlinear interpolation (Sun et al., 2023).
Streaming and Persistent Analytics: CP frameworks like VectraFlow define stateful, composable semantic operators (filter, map, aggregate, group-by, continuous RAG) in LLM-augmented streaming pipelines. Core optimizations include tuple batching—prompting with $T$ tuples in parallel (modeled as $y(T) = T/(aT + b)$ ), and operator fusion. MOBO (multi-objective Bayesian optimization) dynamically configures operator variants, batch size, and fusion to achieve Pareto optimality in throughput–accuracy trade-offs under real-time constraints (Chen et al., 3 Dec 2025).
Sentence Embeddings: D2CSE attaches layerwise CPs to a frozen BERT or RoBERTa, optimized jointly using contrastive learning and conditional replaced token detection, achieving state-of-the-art Spearman correlation (e.g., $79.02$ on seven STS benchmarks) with 1% of the parameters compared to dual-PLM designs. Memory, uniformity, and recall metrics are improved via careful prompt length, depth selection, and [CLS] replacement (Lee, 2023).

6. Limitations, Risks, and Directions for Research

Key limitations and challenges include:

Waywardness and security: Due to the high-dimensional geometry of embedding space, CPs cannot reliably be interpreted or audited via naive token projection. Arbitrary or contradictory discrete projections are possible with negligible accuracy loss. Malicious actors could conceal harmful behaviors (e.g., biased classification) under benign-looking textual projections (Khashabi et al., 2021).
Prompt basis and span: In linearly constrained CPs, interpretability and performance depend on the span and mutual orthogonality of the discrete prompt basis; optimal construction remains an open question (Passigan et al., 2023).
Quantitative measures of interpretability: Most current CP interpretability relies on case studies, concept correlation (Pearson $\rho$ ), or coverage measures; development of robust, task-agnostic metrics is a priority (Chen et al., 2 Dec 2024, Passigan et al., 2023).
Scalability and operator interactions: In streaming, operator fusion and batching can degrade accuracy nonlinearly; dynamic, cost-aware optimization is essential. Nonlinearities in CP control via multi-adapter blending require further paper (Chen et al., 3 Dec 2025, Sun et al., 2023).
Generalization and transfer: CPs do not reliably transfer across models or architectures; initialization and hyperparameters are critical (Khashabi et al., 2021).

Active research is directed at tight human–concept alignment, per-instance or per-user CP design, seamless combination with discrete prompts, and extending interpretable decompositions to sequence-to-sequence and multilingual tasks (Chen et al., 2 Dec 2024, Passigan et al., 2023). Robust, performance-driven, and debiasing CP development hinges on integrating interpretability techniques such as InSPEcT and automated concept selection into the CP training cycle (Ramati et al., 15 Oct 2024).

7. Summary Table: Core Continuous Prompt Frameworks and Their Properties

Framework / Variant	Parameterization	Interpretability
Prefix-Tuning (Li et al., 2021)	Layerwise K,V soft tokens	Unconstrained, no human tracing
Linear Basis (Passigan et al., 2023)	Weighted sum of prompt embeddings	Weight vector on human prompts
Concept-Decomp. (Chen et al., 2 Dec 2024)	$P = AC$ , concept bank $C$	Concept coefficients, human phrases
Prog. Prompts (Razdaibiedina et al., 2023)	Per-task concatenation	Blockwise, per-task modularity
InfoComp (Zhang et al., 27 May 2025)	Private / shared prompts	Modularity, MI-based transfer
ControlPE (Sun et al., 2023)	LoRA adapter + $\alpha$ -weight	Scalar-controllable effect
D2CSE (Lee, 2023)	Layerwise CP + [CLS] prompt	Indirect (contrastive/CRTD)
VectraFlow CPs (Chen et al., 3 Dec 2025)	Streaming operator CPs	Pipeline blockwise

Each framework addresses distinct trade-offs in parameter efficiency, knowledge transfer, interpretability, and operational deployment. The convergence of CP techniques across generative modeling, continual learning, interpretability, and real-time analytics signals a sustained focus on both the epistemic foundations and application frontiers of prompt-based steering for large models.