PS+ Prompting Strategies

Updated 30 December 2025

PS+ Prompting is a family of advanced prompt engineering methods that integrate explicit planning, structured in-context search, and dynamic scaling to enhance model reasoning and inference.
Its methodologies, such as a two-stage plan-and-solve process and counterfactual templates, lead to significant improvements in accuracy and error reduction across LLM, vision, and geometric tasks.
Empirical results demonstrate performance gains (up to 92.2% accuracy on benchmarks) and robustness in complex tasks compared to traditional chain-of-thought approaches.

Prompting with Enhanced Structure ("PS+ prompting") refers to a family of advanced prompt engineering strategies that synergistically combine explicit planning, structured search, and scaling mechanisms to elicit superior reasoning, inference, or editability from large models in text, vision, and geometry tasks. The term appears in multiple domains with context-dependent technical meanings, including: (1) zero-shot "plan-and-solve plus" prompts for LLMs in reasoning tasks, (2) Algorithm-of-Thought and test-time scaling for maximizing LLM computational expressivity, (3) counterfactual prompt formats for causal estimation, (4) "promptable segmentation" for interactively guiding vision models, and (5) geometric "prompt selection" for CAD model reconstruction. The unifying theme is the systematic exploitation of explicit structuring—either in prompt format, search procedure, or interactive feedback loop—to unlock greater model capacity than standard direct or chain-of-thought (CoT) prompting.

1. Formal Definitions and Scenarios for PS+ Prompting

1.1 Plan-and-Solve Plus for LLM Reasoning

In chain-of-thought (CoT) reasoning, PS+ prompting defines a two-stage, purely zero-shot protocol for multi-step tasks:

Planning: Explicit instruction to "extract relevant variables and their corresponding numerals," "devise a plan," and decompose the input into subgoals before proceeding.
Solve and Answer Extraction: The model executes the stepwise plan with careful intermediate computations, followed by an automatic answer-extraction prompt consolidating the reasoning trace into the final answer.

This construct yields substantially reduced missing-step and calculation errors compared to vanilla CoT, producing accuracy on par with or exceeding few-shot methods on arithmetic, commonsense, and symbolic tasks (Wang et al., 2023).

1.2 Algorithm-of-Thought + Test-Time Scaling (LLM)

More broadly, PS+ prompting denotes the combination of structured in-context search (Algorithm-of-Thought, AoT) examples with internal, test-time scaling (allowing reasoning traces of up to exponential length in input size). Here, the search space is made explicit via trace tokens, and the model’s generation length $T$ is determined by the model’s internal "thinking mode" rather than a fixed external cap. Empirically, this approach raises the attainable complexity class from P (polynomial-time) or NP to EXP ( $t(n) = \exp(n)$ ) and NEXP (nondeterministic exponential time) when using AoT + scaling, as formalized below (Xia et al., 28 May 2025).

1.3 Probability of Sufficiency Plus Prompting

In causal inference evaluation, PS⁺ prompting refers to a natural-language counterfactual template: conditioning on observed variable values $(X=x',Y=y')$ , then supposing an intervention $X=x$ and querying the counterfactual $Y=y$ outcome. This enables robust measurement of the LLM’s internalization of the causal sufficiency ( $PS$ ) relation (González et al., 15 Aug 2024).

1.4 Promptable Segmentation in Computer Vision

In segmentation models such as Hi-SAM, PS mode corresponds to "promptable segmentation" where user clicks or points (prompts) are encoded, and the model computes hierarchical masks (word, line, paragraph) in a single forward pass. Here, prompt encoding and multi-headed decoder architectures are optimized for precise, user-directed output (Ye et al., 31 Jan 2024).

1.5 Prompting + Selection in CAD Reconstruction

In geometric learning, PS+ (Prompting + Selection) describes a progressive CAD reconstruction pipeline in which local geometric prompts (planar subclouds) drive autoregressive modeling steps, with a learned selection module choosing among candidate CAD operations for maximal alignment to the ground truth (Yang et al., 24 May 2024).

2. PS+ Prompting: Technical Specifications and Algorithms

2.1 LLM Plan-and-Solve+ Protocol

The PS+ reasoning template consists of:

Initial prompt:

Q: {problem statement}.
A: Let's first understand the problem,
   extract relevant variables and their corresponding numerals, and devise a plan.
   Then, let's carry out the plan, calculate intermediate variables (pay attention to correct numerical calculation and commonsense), solve the problem step by step, and show the answer.

Answer extraction prompt: appended after reasoning trace, e.g. "Therefore, the answer (arabic numerals) is"

The corresponding pseudocode:

def PS_plus_Solve(X):
    prompt1 = "Q: ... A: extract ... devise a plan ... show the answer."
    reasoning_trace = LLM.generate(prompt1)
    prompt2 = prompt1 + reasoning_trace + "Therefore, the answer (arabic numerals) is"
    return LLM.generate(prompt2)

Careful variable extraction and explicit planning encourage LLMs to avoid missing steps and calculation errors (Wang et al., 2023).

2.2 PS+ (AoT + Test-Time Scaling) for Hard Reasoning

Key elements:

Prompt encodes a worked search trace (CoT or AoT), typically via in-context exemplars containing backtracking or tree-search traces.
Internal scaling: an inference-time flag that permits the model to control its own generation length $T$ , up to $T = \exp(n)$ for input size $n$ .
Beam search or best-of-N sampling can be employed as the in-context search mechanism.

Pipeline pseudocode (Xia et al., 28 May 2025):

def PS_plus_Solve(x, E, B, N, internal_scaling):
    prompt = format(E) + "\n### Target\n" + x
    beam = [(prompt, 0)]
    for step in range(1, T_limit if internal_scaling else T_max):
        new_beam = {}
        for ctx, sc in beam:
            candidates = decode(pi_theta(.|ctx), top B or N)
            for y in candidates:
                ctx_prime = ctx + y
                sc_prime = sc + log(pi_theta(y|ctx))
                if is_solution(ctx_prime):
                    return extract_answer(ctx_prime)
                new_beam[ctx_prime] = sc_prime
        beam = top_B(new_beam)
    return "No Solution Found"

2.3 Causal Sufficiency Prompt (PS⁺)

Template:

Explicit factual condition: "In reality, 10 is not divisible by 3 and not divisible by 6."
Hypothetical intervention: "Suppose ... you now assume it is divisible by 3."
Counterfactual query: "Under this assumption, would ...?"

Evaluation: Repeated prompting and bootstrapping produce empirical $\widehat{PS}$ estimates, compared to ground truth (González et al., 15 Aug 2024).

2.4 PS in Vision and 3D Geometry

Segmentation (Hi-SAM): User clicks are mapped via a positional encoder; hierarchical decoder outputs mask logits for word/line/paragraph directly per-prompt. Token concatenation and attention handle integration of prompt guidance (Ye et al., 31 Jan 2024).
CAD (PS-CAD): RANSAC-extracted planar patches provide prompts; per-prompt autoregressive decoders emit sketch+extrude parameters; selection module scores geometric fit; iterative execution constructs the CAD sequence (Yang et al., 24 May 2024).

3. Theoretical Advances from PS+ Prompting

Theoretical results delineate the computational boundaries under PS+ regimes for decoder-only Transformers:

Prompting Type	Trace Length	Complexity Class
CoT (greedy)	poly(n)	P
AoT (tree)	poly(n)	NP
CoT	exp(n)	EXP
AoT	exp(n)	NEXP

Allowing AoT exemplars and internal scaling yields in-principle equivalence to NEXP computations for sufficiently powerful models and sufficient trace length (Xia et al., 28 May 2025). Only the minimal core reasoning tokens (with redundancies pruned) determine this expressivity.

4. Empirical Results and Comparative Performance

4.1 LLM Reasoning Tasks

On arithmetic and multi-step logical tasks (e.g., GSM8K, SVAMP, MultiArith, AddSub), PS+ achieves accuracy up to 92.2% and consistently outperforms vanilla Zero-Shot-CoT and Zero-Shot-Program-of-Thought (Wang et al., 2023):

Calculation errors reduced (from 7% to 5%)
Missing-step errors reduced (from 12% to 7%)
Self-consistency over multiple PS+ samples further boosts accuracy (e.g., GSM8K: 59.3% → 73.7%)

4.2 NP-Hard Benchmarks (PS+ AoT + Scaling)

Combining AoT prompting and internal scaling moves models from sub-5% to 30-40% success rates on Vertex Cover, 3-Dimensional Matching, and complex planning benchmarks. Isolated use of AoT or scaling yields <5% gains; only the PS+ combination produces such large improvements (Xia et al., 28 May 2025).

Model	Direct-WS	AoT-WS	AoT-IS
Qwen3	0%	0%	2%
Claude3.7	1%	1%	31%

4.3 Vision and Geometry

Hi-SAM (PS mode): Word-level PQ improves by 2–4% with promptable segmentation (oracle click evaluation) vs. automatic mask generation; line- and paragraph-level metrics are similarly boosted (Ye et al., 31 Jan 2024).
PS-CAD: Reduces Chamfer Distance by ∼10% and Edge Chamfer Distance by ∼15% relative to previous SOTA on DeepCAD (Yang et al., 24 May 2024).

5. Significance, Benchmarking, and Best Practices

PS+ techniques reveal that standard evaluation paradigms—especially those imposing fixed, short reasoning trace lengths—substantially underestimate the true operational capabilities of large models on complex tasks. Critical best practices include:

Explicitly incorporating planning or structured search traces into prompts.
Permitting the model to self-regulate reasoning length at inference.
Analyzing performance as a function of allowed trace length $T$ , as observed ceilings at $T = \mathrm{poly}(n)$ may misrepresent maximal capability at $T = \exp(n)$ .
For counterfactual tasks, prompt templates must clearly demarcate factual status and intervention, use uniform language, and collect multiple samples per instance to bootstrap uncertainty (Xia et al., 28 May 2025, González et al., 15 Aug 2024).

In geometric and vision settings, prompt induction (clicks, planar subclouds) fused with selection or hierarchical decoding unlocks state-of-the-art performance with minimal additional supervision.

6. Limitations and Future Directions

Limitations of PS+ approaches include:

Dependence on high-quality structured exemplars for AoT (automating their generation remains challenging).
Potential for combinatorial explosion in trace length; practical deployment requires controlling computational costs or hybridizing with parallel scaling methods.
In vision, the informativeness of prompt locations and the transferability of trained prompt encoders present ongoing concerns.
In causal prompting, ambiguous hypothetical phrasing or interventions outside the model’s distribution may inflate inconsistency rates.

Potential future work focuses on automated generation of AoT prompts, hybrid test-time scaling, improved theoretical characterization of $P_{success}(n, T)$ , and further cross-domain transfer of the PS+ paradigm to new model architectures and domains (Xia et al., 28 May 2025, Ye et al., 31 Jan 2024).

Selected References:

"Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by LLMs" (Wang et al., 2023)
"Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling" (Xia et al., 28 May 2025)
"Does Reasoning Emerge? Examining the Probabilities of Causation in LLMs" (González et al., 15 Aug 2024)
"Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation" (Ye et al., 31 Jan 2024)
"PS-CAD: Local Geometry Guidance via Prompting and Selection for CAD Reconstruction" (Yang et al., 24 May 2024)