Discrete Prompt Tuning Framework

Updated 16 March 2026

The framework defines discrete prompts as sequences of vocabulary tokens and formalizes their optimization as a combinatorial problem using methods like Bayesian optimization and gradient-based projection.
It supports applications in modular, federated, and composable tuning across diverse domains, achieving near fine-tuning accuracy with enhanced interpretability and transferability.
Empirical studies show that discrete prompt tuning can rival continuous methods by offering improved query efficiency and clear, actionable insights on model behavior.

A discrete prompt tuning framework is a formalized methodology for optimizing sequences of discrete tokens—“hard” prompts—prepended to the input of a frozen or partially tunable pre-trained model in order to steer model behavior on specific downstream tasks. In contrast to continuous (“soft”) prompt tuning, discrete prompt tuning operates in the combinatorial space of actual vocabulary tokens, enabling interpretability, transferability, and compatibility with models accessible only via black-box APIs. Frameworks in this category have emerged across vision and language domains, with objectives ranging from maximizing task accuracy to supporting modularity, federated optimization, and joint prompt-parameter adaptation.

1. Formal Definitions and Problem Scope

A discrete prompt is a length- $L$ sequence $x = (t_1,\ldots,t_L)$ with each $t_i$ drawn from a vocabulary $V$ of size $|V|$ (Sabbatella et al., 2023, Wen et al., 2023). Suppose a pre-trained model $M$ implements a scoring function $f(x) = \mathrm{score}_M([x;q])$ defined by prepending prompt $x$ to a query $q$ and evaluating the task-specific objective (e.g., classification accuracy). The discrete prompt tuning problem is then cast as the combinatorial optimization

$x^\ast = \arg\max_{x \in V^L} f(x).$

The search space $x = (t_1,\ldots,t_L)$ 0 is typically infeasible for exhaustive search due to the large vocabulary and prompt length, necessitating efficient optimization strategies such as Bayesian Optimization (Sabbatella et al., 2023), gradient-based projection (Wen et al., 2023), or policy-gradient reinforcement learning (Li et al., 2023).

The class of discrete prompt tuning frameworks further encompasses distributed and modular settings: in federated scenarios, each client $x = (t_1,\ldots,t_L)$ 1 with private data $x = (t_1,\ldots,t_L)$ 2 seeks to jointly optimize a global prompt $x = (t_1,\ldots,t_L)$ 3 maximizing a weighted sum over local accuracies—subject to privacy and query efficiency constraints (Wu et al., 2024, Wang et al., 17 Jun 2025). Modular frameworks, such as à-la-carte learning, target the composable training and inference of independent prompt modules in isolation on disjoint data sources (Bowman et al., 2023).

2. Optimization Methods for Discrete Prompts

Optimization in this discrete, high-cardinality space is the central methodological challenge. Prominent approaches include:

Gradient-Based Discrete Relaxation: Methods such as “PEZ” maintain a continuous proxy $x = (t_1,\ldots,t_L)$ 4, projecting to nearest neighbor embeddings in $x = (t_1,\ldots,t_L)$ 5 each iteration, with gradients computed with respect to $x = (t_1,\ldots,t_L)$ 6 but applied to $x = (t_1,\ldots,t_L)$ 7 (Wen et al., 2023). Optional Gumbel-softmax relaxations parametrized by logits $x = (t_1,\ldots,t_L)$ 8 enable stochastic nearly-discrete optimization.
Bayesian and Black-Box Optimization: Discrete Prompt BO embeds each prompt $x = (t_1,\ldots,t_L)$ 9 into a continuous space via $t_i$ 0, builds a Gaussian process surrogate $t_i$ 1, and uses Expected Improvement as the acquisition function; candidates are decoded back to $t_i$ 2 by nearest-neighbor mapping (Sabbatella et al., 2023).
Reinforcement Learning and Policy Gradient: In frameworks such as DP $t_i$ 3O, a small policy network $t_i$ 4 (typically a two-layer MLP) selects prompts from a human-readable set $t_i$ 5 for each state $t_i$ 6, with training via REINFORCE based on SUE (Supervised + Unsupervised Entropy) reward (Li et al., 2023).
Meta-Learning and LLM-as-Optimizer: Frameworks such as OPRO and EvoPromptGA use LLMs not only to score but also to propose new prompt candidates, performing evolutionary operations or differential evolution in prompt space (Zehle et al., 2 Dec 2025).
Federated/Distributed Optimization: In FedDTPT and FedOne, clients perform local discrete optimization (greedy or Gumbel-softmax-based) based on in-silo feedback, with centralized prompt aggregation strategies exploiting clustering and semantic attention (Wu et al., 2024, Wang et al., 17 Jun 2025).

The table below summarizes core optimization approaches in recent frameworks:

Framework	Optimization Method	Reference
PEZ	Gradient projection	(Wen et al., 2023)
DiscretePromptBO	Bayesian optimization	(Sabbatella et al., 2023)
DP $t_i$ 7O	Policy Gradient RL	(Li et al., 2023)
FedDTPT/FedOne	Zeroth-order, black-box FL	(Wu et al., 2024, Wang et al., 17 Jun 2025)
promptolution (CAPO)	Meta-LLM, evolutionary	(Zehle et al., 2 Dec 2025)

3. Modular, Federated, and Composable Frameworks

Discrete prompt tuning has been actively extended to settings beyond monolithic, single-prompt optimization:

À-la-carte Prompt Tuning (APT): Each distinct data source $t_i$ 8 is trained with its own prompt module $t_i$ 9 in isolation; inference concatenates any subset of prompts via a masking/structured attention scheme, yielding accuracy within 2–5% of a jointly-trained prompt on the data union, even for up to 20 shards (Bowman et al., 2023). This approach achieves state-of-the-art continual learning performance on Split CIFAR-100 and CORe50.
Federated Discrete Prompt Tuning: Clients optimize local discrete tokens by gradient-free methods (e.g., MLM API-driven token mutation) and exchange only prompt summaries to maintain privacy. Aggregation is performed by clustering prompt-token semantic embeddings and selecting representatives (Wu et al., 2024). FedOne demonstrates that activating only one client per round achieves optimal query efficiency, converging with $V$ 0 fewer queries than FedAvg-type baselines (Wang et al., 17 Jun 2025).
Composable Privacy and Unlearning: The compartmentalization property in modular frameworks (APT) allows for perfect machine unlearning: deleting a prompt $V$ 1 removes all influence of $V$ 2, and arbitrary prompt combinations can reflect individual access rights and preferences without retraining (Bowman et al., 2023).
Transferability: Discrete prompts created via federated methods are shown to transfer between different base LLMs with minor loss in accuracy (e.g., $V$ 3– $V$ 4) owing to their representation as valid tokens retaining semantic meaning (Wu et al., 2024).

4. Empirical Performance and Analyses

Empirical results across frameworks consistently show that well-tuned discrete prompts can match or closely approach the performance of continuous ("soft") prompts or full fine-tuning baselines, often with considerable benefits in interpretability, modularity, and resource efficiency.

Classification and Reasoning Tasks:
- On GLUE tasks, DiscretePromptBO outperforms average black-box baselines by wide margins, e.g., achieving 78.4 F1 on MRPC vs. 71.2 F1, and does so with higher sample efficiency (Sabbatella et al., 2023).
- DP $V$ 5O, with a policy network comprising only $V$ 6 of model parameters, surpasses the previous SOTA RLPrompt by $V$ 7 points in few-shot accuracy across sentiment datasets (Li et al., 2023).
Vision Benchmarks:
- In APT, composed prompts yield in-domain classification accuracy within 2% (even 5% out-of-domain) of union-trained benchmarks, and on continual learning benchmarks (CIFAR-100, CORe50) set SOTA performance (e.g., APT-Weight achieves 85.21% on Split CIFAR-100 vs. joint paragon $V$ 888%) (Bowman et al., 2023).
Sample and Compute Efficiency:
- PEZ demonstrates robust convergence and outperforming hand-crafted prompt baselines in both language and text-to-image settings; projection-based discrete optimization avoids the combinatorial expense of exhaustive search (Wen et al., 2023).
- FedOne enables federated discrete-tuning with only $V$ 9 the number of queries required by multi-client schemes, preserving accuracy benefits (Wang et al., 17 Jun 2025).
Benchmarks and Comparisons:

| Method | Task/Setting | Metric | Score | |-----------------|--------------------|---------------------|-----------------------------| | APT | CIFAR-100 (CL) | accuracy | 85.21% (APT-Weight) | | DP $|V|$ 0O | Sentiment, few-shot| avg. accuracy gain | +1.52pp over RLPrompt (SOTA)| | FedDTPT | GLUE (Black-box FL)| accuracy (DeepSeek) | 95.33% vs. 53.82% (baseline)| | PEZ | SST-2 (GPT-2 L) | accuracy | 88.05% (with fluency) |

5. Framework Implementations and Software Tools

Recent frameworks facilitate plug-and-play discrete prompt tuning via unified interfaces:

promptolution: Provides modular abstractions—LLM Wrapper, Predictor, Task, and Optimizer (with OPRO, EvoPromptGA, EvoPromptDE, and CAPO built-in)—enabling direct optimization in the discrete space with strong empirical performance. CAPO, a cost-aware optimizer, achieves 93.7% on GSM8K and 56.3% on SST-5 under constrained evaluation budgets (Zehle et al., 2 Dec 2025).
Extensibility: All major components are inheritably extensible, and optimization steps, evaluation callbacks, and LLM adapters can be augmented to support custom search or evaluation strategies. All empirical runs are token-budget–capped to prevent runaway cost.

This suggests that the field is converging toward unified, modular frameworks supporting diverse search methods, multi-task configuration, and compatibility with practical deployment constraints.

6. Limitations and Prospective Directions

Reported limitations include challenge in scaling to extremely large vocabularies and long prompt lengths due to combinatorial explosion and embedding relaxation failures (Sabbatella et al., 2023). Policy-gradient methods require careful normalization and may struggle as the prompt-set cardinality increases (Li et al., 2023). Bayesian approaches admit efficiency bottlenecks in high-dimensional embedding spaces, motivating integration of more advanced surrogate modeling or discrete search techniques. Further, dependency on external LLM APIs for prompt generation (e.g., GPT-4) entails cost and access considerations.

Future avenues include integrating beam-search or RL-based optimizers into unified toolkits, designing fairness- or robustness-aware objective functions, and extending frameworks to fully modular, multilingual, or multi-task deployments (Zehle et al., 2 Dec 2025, Sabbatella et al., 2023, Li et al., 2023). Hybrid discrete–continuous schemes, joint prompt-parameter learning (e.g., MetaTuner (Bo et al., 29 Sep 2025)), and federated continual learning remain active research areas.

7. Summary and Outlook

Discrete prompt tuning frameworks have established themselves as effective, interpretable, and resource-efficient alternatives to full model tuning and continuous prompt learning. By formalizing discrete prompt optimization as combinatorial or policy-based search problems, and leveraging modular, federated, and evolutionary techniques, they enable state-of-the-art performance in a range of NLP and vision tasks, with the added benefits of composability, privacy, query efficiency, and transferability (Bowman et al., 2023, Wu et al., 2024, Sabbatella et al., 2023, Zehle et al., 2 Dec 2025, Bo et al., 29 Sep 2025, Wang et al., 17 Jun 2025, Li et al., 2023, Wen et al., 2023). These advances have laid the foundation for scalable, modular, and privacy-preserving deployment of prompt-based interfaces in pre-trained models.