Sample-Specific Prompt Optimization
- SSPO is a family of methods that dynamically adapts prompts on a per-sample basis to address the unique semantic and reasoning needs of each input.
- It leverages mechanisms like execution-free evaluators, gradient-based updates, and test-time adaptations to refine prompt quality iteratively.
- Empirical studies show significant accuracy gains across domains, though challenges include balancing overfitting, computational overhead, and generalization.
Searching arXiv for the cited SSPO papers to ground the article and confirm bibliographic details. Searching for "Sample-Specific Prompt Optimization arXiv" and the specific IDs mentioned. Sample-Specific Prompt Optimization (SSPO) denotes prompt adaptation that is conditioned on an individual input rather than a single globally optimized template. In the recent literature, this sample-specific principle appears in several forms: per-query prompt rewriting guided by an execution-free evaluator, test-time prompt refinement for text-to-video generation, regularized textual-gradient updates that suppress sample-specific rule accumulation, temporary per-sample parameter vectors added at inference time, prompt-template evolution for hard samples in RLVR, and per-sample ensembling of source prompt-conditioned models (Chen et al., 25 Nov 2025, Gao et al., 23 Oct 2025, Fu et al., 20 May 2026, Hu et al., 18 May 2025, Lu et al., 23 Mar 2026, Peng et al., 2022). This suggests that SSPO is best understood as a family of per-instance control mechanisms whose common premise is that prompt quality is conditional on the sample, the backbone, and the optimization signal.
1. Conceptual scope and distinguishing features
SSPO is explicitly contrasted with static-template optimization in "A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization" (Chen et al., 25 Nov 2025). There, static methods such as APE and TextGrad are described as searching for a single template that minimizes average loss across a dataset, whereas the sample-specific alternative “localizes” optimization to each query instance and adapts the prompt to the individual semantic and reasoning structure of the query. The same paper further distinguishes its approach from prior query-dependent methods such as Self-Refine, ProRefine, QPO, and Prompt-OIRL by replacing unstable textual feedback and black-box reward models with a multi-metric evaluator trained to predict downstream performance without executing the prompt.
The same sample-specific principle is instantiated differently in other domains. In RAPO++, Stage 2 SSPO is a closed-loop, gradient-free test-time procedure that iteratively refines a text-to-video prompt using semantic alignment, spatial fidelity, temporal coherence, and optional optical-flow-based signals (Gao et al., 23 Oct 2025). In SLOT, the prompt tokens are not edited at all; instead, a per-sample vector is optimized on the prompt-only next-token loss and added to the final hidden layer before the LM head (Hu et al., 18 May 2025). In PO, SSPO targets “hard samples” in RLVR by evolving prompt templates that increase the probability of discovering successful trajectories, followed by context distillation so that the policy learns under the original input rather than only under the augmented prompt (Lu et al., 23 Mar 2026). In SESoM, the same idea appears at the model-output level: sample-specific weights are learned over source prompt-conditioned models, rather than over prompt embeddings themselves (Peng et al., 2022).
TextReg adds a critical qualification to the SSPO agenda. It studies the tendency of iterative prompt rewriting to drift into sample-specific rule accumulation, formalizing this as prompt distributional overfitting caused by coupled growth in capacity cost and scope narrowness (Fu et al., 20 May 2026). This suggests that SSPO is not only about increasing local fit to a sample, but also about controlling the representational side effects of that localization.
2. Formal objectives and optimization signals
A central divide within SSPO concerns where the optimization signal comes from. In the evaluation-instructed framework, prompt quality is made performance-grounded through four selected dimensions—nll_score, stability_score, mi_score, and query_entropy—defined over , where is a task-descriptive prefix, the query, and the prompt (Chen et al., 25 Nov 2025). The evaluator outputs a success probability and a quality vector,
and is trained with a bi-level multi-objective that balances classification and regression, with dynamic weights adapted from the sensitivity of the classification objective to each predicted metric. At inference time, determines whether optimization is needed, and the gradient sensitivities with respect to determine which metric-specific rewrites are applied. A prompt’s binary quality label is set to $1$ if average accuracy over 10 stochastic runs exceeds 0, and 1 otherwise.
RAPO++ uses execution-based, multi-source feedback computed from the generated video 2 (Gao et al., 23 Oct 2025). Its composite reward is
3
where the positive terms score semantic alignment, spatial fidelity, temporal coherence, and motion realism, while the penalties preserve training-distribution alignment and user intent. Candidate prompts are selected with an “Average Ranking” rule across metrics, and SSPO stops when reward increases plateau or the iteration budget is exhausted.
TextReg formalizes the opposite pressure: not how to maximize sample-specific fit alone, but how to do so without collapsing generalization (Fu et al., 20 May 2026). It defines representational inefficiency as
4
where 5 is token length and 6 is scope narrowness, implemented through a proxy based on rule recurrence in a RuleBank. Its regularized objective is
7
operationalized through regularized textual gradients rather than direct numeric optimization.
SLOT and P8O extend SSPO beyond textual prompt rewriting. In SLOT, the intervention is
9
which is equivalent to adding a vocabulary-wide logit shift 0 learned from the prompt-only next-token loss (Hu et al., 18 May 2025). In P1O, the prompt template is treated as a latent discrete variable 2 inside a joint objective
3
and prompt optimization is valuable precisely because it converts near-zero success rates on hard samples into non-zero group-relative advantages (Lu et al., 23 Mar 2026).
3. Principal methodological families
The surveyed literature divides SSPO into several recurring methodological families.
| Family | Representative mechanism | Representative paper |
|---|---|---|
| Execution-free evaluator-guided rewriting | Predict 4 and metric scores from text, then rewrite prompts by diagnosed failure mode | (Chen et al., 25 Nov 2025) |
| Execution-based closed-loop refinement | Generate, verify, rewrite, and select by multi-source feedback and Average Ranking | (Gao et al., 23 Oct 2025) |
| Regularized textual-gradient optimization | Purify task gradients, diagnose prompt drift, and select regularization-compatible rewrites | (Fu et al., 20 May 2026) |
| Test-time latent-state adaptation | Learn a per-sample additive vector 5 at the last hidden layer | (Hu et al., 18 May 2025) |
| RLVR hard-sample prompt evolution | Evolve prompt templates for hard samples and distill gains into the policy | (Lu et al., 23 Mar 2026) |
| Output-level sample-specific routing | Learn per-sample ensemble weights over source prompt-conditioned models | (Peng et al., 2022) |
Evaluator-guided rewriting is the most explicit formulation of SSPO as interpretable prompt editing. In (Chen et al., 25 Nov 2025), high NLL suggests weak semantic guidance or instruction conflict; low semantic stability indicates random trajectories or format ambiguity; low mutual information implies hollow templates or missing schemas; high query entropy reveals ambiguity or missing assumptions intrinsic to the query. Rewrites are correspondingly aligned to labeled error categories such as “instruction conflict,” “output format ambiguity,” “missing schema,” and “query ambiguity.”
Execution-based closed loops dominate non-text domains. RAPO++ treats SSPO as test-time scaling: the rewriter LLM generates candidate prompts, the T2V generator renders videos, verifiers score them, and the best candidate becomes the next prompt (Gao et al., 23 Oct 2025). The optimizer is gradient-free and memory-guided, and can be augmented with physics-aware feedback through optical flow on PhyGenBench and VideoPhy.
TextReg addresses a pathology that becomes especially acute in sample-specific optimization. Dual-Evidence Gradient Purification rejects CASE_PATCH and STYLE_ONLY updates; Semantic Edit Regularization detects whether a prompt transition increases capacity cost or narrows scope; Regularization-Guided Prompt Update chooses among task-faithful candidates using compatibility with the regularization signal (Fu et al., 20 May 2026). The method is therefore not an alternative to SSPO so much as a control layer over it.
SLOT and SESoM show that the sample-specific principle does not require editing discrete prompt text. SLOT performs per-sample modulation in representation space, with only 6 adapted parameters and cached last-layer features for efficiency (Hu et al., 18 May 2025). SESoM performs per-sample routing across source prompt-conditioned models by computing logits from each source model and combining them with attention-style weights 7 predicted from the input representation and the source logits themselves (Peng et al., 2022).
4. Representative systems and empirical results
The evaluation-instructed framework reports 83.7% validation accuracy for predicting whether a prompt will succeed, compared with 69% for embedding+XGBoost even with ground-truth metric scores (Chen et al., 25 Nov 2025). The selected metric weights are query_entropy 32.7%, nll_score 26.4%, stability_score 22.3%, and mi_score 18.6%. Its optimization pipeline consistently improves downstream performance across eight datasets and three backbone models. On LegalBench, the reported gains are approximately +10% over the LLM-only baseline across all three backbones, including LLaMA-3 from 0.55 to 0.70 and GPT-4o from 0.83 to 0.90. On MedQA, the gains are +5–6%, and on BBH sports_understanding, LLaMA-3 improves from 0.68 to 0.75 while GPT-4o improves from 0.78 to 0.83.
RAPO++ reports that SSPO improves both general text-to-video quality and physics-aware generation (Gao et al., 23 Oct 2025). On VBench with LaVie, RAPO++ reaches 82.65% total, including imaging quality 73.48%, human action 99.20%, multiple objects 71.89%, and spatial relationship 64.76%. On T2V-CompBench, the full system reaches 0.742 on Consistent Attribute Binding, 0.294 on Dynamic Attribute Binding, 0.632 on Action Binding, and 0.849 on Object Interactions. Physics-aware SSPO raises HunyuanVideo Physical Consistency on PhyGenBench from 0.38 to 0.57 and Semantic Alignment from 0.24 to 0.42 over four SSPO rounds. The T2V-CompBench ablation further isolates the contribution of SSPO: without fine-tuning LLM and without SSPO, scores are 0.620, 0.232, 0.483, and 0.760 across the four submetrics; with SSPO but without fine-tuning LLM, they become 0.629, 0.236, 0.542, and 0.778; the best setting combines both.
TextReg evaluates SSPO under out-of-distribution stress rather than only in-distribution gain (Fu et al., 20 May 2026). Across Logical Deduction, Tracking Shuffled Objects, GSM8K, SVAMP, and MultiArith, it reports OOD accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE. Specific examples include +10.0% and +9.9% over TextGrad on Tracking Shuffled Objects with Llama-3.1-8B-Instruct at 5 and 7 objects, and 57.9% versus 46.1% on Logical Deduction with Phi-3.5-Mini-Instruct.
SLOT demonstrates that sample-specific adaptation can be parameter-efficient and fast (Hu et al., 18 May 2025). On GSM8K with Qwen2.5-7B, accuracy rises from 57.54% to 66.19% (+8.6% absolute). On GPQA with DeepSeek-R1-Distill-Llama-70B, performance improves from 65.66% to 68.69%, reported as SOTA among open-source ~70B models. On AIME24 with the same model, accuracy rises from 63.33% to 73.33%. The paper reports that increasing optimization iterations from 0 to 5 on 30 GSM8K prompts with Qwen2.5-7B on a single NVIDIA V100 adds only +12.83 s, corresponding to +7.9% total inference time.
P8O reports substantial gains on the hardest reasoning benchmarks inside RLVR (Lu et al., 23 Mar 2026). On DeepScaler-5K, P9O with teacher reflection reaches 65.2% average, surpassing the GRPO baseline by +4.7% absolute; AIME24 improves by +12.9% and AIME25 by +11.7%. Removing context distillation drops average accuracy to 55.6%, compared with 60.5% for GRPO and 65.2% for P0O-teacher. Removing group prompt diversity lowers the average to 64.2%.
SESoM provides an earlier, few-shot transfer-oriented realization of sample-specific adaptation (Peng et al., 2022). In the 32-shot regime averaged over 20 seeds, T5-base reaches 67.54 average score with SESoM, compared with 61.22 for Uniform, 64.66 for Majority, 62.28 for Fixed-weight, and 54.61 for ATTEMPT. For T5-large, SESoM reaches 74.69, and for T5-XL, 76.22. The gains are especially large over prompt fusion: +13.0 over ATTEMPT on T5-base, +12.77 on T5-large, and +14.25 on T5-XL.
5. Limitations, failure modes, and recurring misconceptions
A recurring misconception is that SSPO can compensate for backbone capability limits. The evaluation-instructed framework states explicitly that prompt optimization cannot overcome backbone capability limits and reports that on MATH500, meaningful improvements appear only for GPT-4o, with mid-sized LLaMA models omitted due to capability limits (Chen et al., 25 Nov 2025). This is consistent with SLOT’s framing of per-sample adaptation as improved instruction following rather than guaranteed factual calibration, and with its warning that optimization on the prompt text alone does not directly improve factual calibration (Hu et al., 18 May 2025).
Another recurrent issue is overfitting at the representation level. TextReg identifies prompt distributional overfitting as the coupled growth of prompt length and scope narrowness, which can reduce training loss while increasing the generalization gap under shifted or harder inputs (Fu et al., 20 May 2026). RAPO++ identifies related failure modes in text-to-video generation: numeracy failures such as “five parrots” or “three giraffes,” overfitting to one metric at the expense of others, prompt verbosity, and verifier domain shift (Gao et al., 23 Oct 2025). SESoM describes negative transfer, bias toward generally strong sources, and instability under drastic target–source domain shifts (Peng et al., 2022).
SSPO also introduces systems-level constraints. RAPO++ reports roughly 3× inference time for 1 and a memory overhead of about 2 GB from LLaVA-OneVision relative to the T2V model (Gao et al., 23 Oct 2025). SLOT requires direct access to the final hidden states and LM head, which black-box APIs typically do not expose (Hu et al., 18 May 2025). P2O adds the cost of mutation, evaluation, and greedy template assignment, while also depending on deterministic or at least verifiable reward functions; its use of augmented-input sampling and original-input learning is explicitly described as a one-step off-policy update, which the paper notes as an area for further stabilization (Lu et al., 23 Mar 2026).
6. Implementation patterns and open directions
Across the surveyed work, a common SSPO pattern emerges: initialize a sample-conditioned prompt or control state, obtain a diagnostic signal, perform a constrained update, and stop under a budget or plateau criterion. In the evaluation-instructed framework, the loop is execution-free evaluation, metric-aware diagnosis, query-dependent rewriting, and reevaluation, with trigger threshold 3 and up to 4 iterations (Chen et al., 25 Nov 2025). In RAPO++, general SSPO runs 2–4 iterations depending on budget, while physics-aware experiments typically use 3–4 rounds (Gao et al., 23 Oct 2025). SLOT uses a small number of optimization steps, with default 5, learning rate 6, and AdamW weight decay 7 (Hu et al., 18 May 2025). TextReg’s practical recipe uses thresholds such as 8, mini-batches of size 9, iteration budget 0, and candidate count 1–2 (Fu et al., 20 May 2026). P3O uses 4 rollouts, temperature 5, a dev set of about 300 hard examples, and beam/width 6 for GEPA (Lu et al., 23 Mar 2026). SESoM trains a small gating network for 20 epochs with AdamW at learning rate 7, while keeping the PLM and source prompts frozen (Peng et al., 2022).
Open directions are similarly diverse but structurally aligned. The evaluation-instructed framework proposes expanding beyond performance metrics to include safety, efficiency, readability, and controllability (Chen et al., 25 Nov 2025). RAPO++ identifies numeracy-aware verifiers, adaptive test-time scaling, identity tracking, and causal consistency as natural extensions (Gao et al., 23 Oct 2025). TextReg points toward stronger representation control for discrete text-space optimization, particularly where SSPO is prone to narrow rule accumulation (Fu et al., 20 May 2026). SLOT suggests multi-handle SSPO variants, late-layer extensions, and trust-region constraints on 8 or directly on 9 (Hu et al., 18 May 2025). P0O identifies uncertainty-aware Pareto selection, adaptive clustering for sample groups, and meta-learning to warm-start templates for new domains (Lu et al., 23 Mar 2026). SESoM points toward sparse routing, calibration-aware gating, and direct per-sample prompt adjustment guided by routing weights (Peng et al., 2022).
Taken together, these directions indicate that SSPO has evolved from a narrow question of whether prompts should vary per input into a broader research program about per-instance control, optimization signal design, interpretability of prompt edits, and the boundary between local adaptation and generalizable competence.