SPEAR: Code-Augmented Agentic Prompt Optimization

Published 25 May 2026 in cs.CL | (2605.26275v1)

Abstract: Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools -- evaluate, python, set_prompt, finish -- that decides autonomously how and when to use them. The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor. We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ($κ$ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; $κ$ 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484. Ablations show the Python tool is the largest single lever on complex judge tasks ($Δ\approx +0.79κ$ on the 5-class tool-selection judge, $Δ\approx +0.35κ$ on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces SPEAR, which dynamically executes analysis code to optimize prompt engineering and address structural error patterns.
It employs a flexible agentic loop with tools like evaluate, python, set_prompt, and finish, safeguarded by auto-rollback and guard-metric floors.
Empirical results show rapid convergence and substantial performance improvements on both industrial judge tasks and public benchmarks.

Authoritative Summary of "SPEAR: Code-Augmented Agentic Prompt Optimization" (2605.26275)

Motivation and Positioning

The paper introduces SPEAR, a code-augmented agentic optimizer for Automatic Prompt Engineering (APE), addressing limitations of prior fixed-pipeline approaches (e.g., OPRO [yang2024opro], EvoPrompt [guo2024evoprompt], TextGrad [yuksekgonul2024textgrad], GEPA [agrawal2025gepa]). In existing models, the optimizer follows a predetermined error-analysis and rewrite pattern, constrained to static strategies that fail to capture structural error clusters such as confused class pairs or label-rule contradictions in the evaluation data. SPEAR adopts the code-as-action paradigm from CodeAct [wang2024codeact], enabling the optimizer itself to dynamically write and execute analysis code over its evaluation DataFrame.

Figure 1: SPEAR architecture—a GPT-5.4 agent brain with four tools (evaluate, python, set_prompt, finish), guarded by auto-rollback and guard-metric floor for monotone improvement.

Algorithmic Design

SPEAR's agentic loop exposes four tools: evaluate (scoring prompts on subsets or full splits), python (arbitrary code execution in an AST-restricted sandbox), set_prompt (prompt modification), and finish (termination). Tool outputs update the agent state, which tracks evaluation history and performance baselines, monitored via two guardrails:

Auto-rollback: Reverts prompt rewrites when the primary metric regresses below the best-seen value.
Guard-metric floor: Optionally enforces a secondary metric threshold.

The agent is not restricted to any fixed evaluate-analyze-rewrite ordering, granting orchestration autonomy. The Python sandbox supports detailed structural analyses (e.g., confusion matrices, error clustering, per-group metrics) inaccessible through textual-only reflection. The workflow is defined in detail in the agent system prompt, which prescribes thorough error analysis, prompt drafting via LLM calls, and concept-level rule definition.

Experimental Evaluation

Benchmarks span three industrial LLM-as-judge datasets (Hiring Assistant, CMA, Facet Suggestion—totaling 13 judge tasks) and public datasets (BBH-7, GSM8K). Task models use GPT-4o; optimizers run GPT-5.4 except for TextGrad's reflection engine (GPT-4o due to system constraints).

On industrial tasks, SPEAR demonstrates optimal performance across 11 of 12 directly comparable tasks (excluding a train-only row), with $\kappa$ and F1-macro absolute gaps up to 0.76 versus baselines (GEPA, TextGrad). On BBH-7, SPEAR achieves 0.938 mean accuracy, outperforming GEPA (0.628) and TextGrad (0.484); on GSM8K, all methods tie near 0.96 accuracy.

Figure 2: Per-method convergence on three industrial tasks—SPEAR converges rapidly to high native-metric scores, with gaps in optimization-internal scores for GEPA/TextGrad.

Case studies illustrate SPEAR's capability to autonomously discover and address structural errors:

Hiring Assistant (job location): The SPEAR agent leverages python tool for clustering failure modes and reveals a dataset label-rule contradiction, prompting a rule-level rewrite that achieves $\kappa=0.76$ (seed 0.035; ceiling 0.74).
CMA (tool-missing): SPEAR constructs confusion matrices to identify and resolve class-pair misclassifications via targeted rewrites (seed $\kappa=0.20$ to $\kappa=0.95+$ ).
Figure 3: SPEAR agent trace for Hiring Assistant job location—tool sequence (python, evaluate, set_prompt), $\kappa$ trajectory, and the label-contradiction-driven prompt rewrite.

Ablation and Component Analyses

Extensive ablation studies isolate component contributions:

Python tool removal: Largest degradation ( $-0.35\kappa$ Hiring Assistant location, $-0.79\kappa$ CMA tool-missing), directly linking the analytic affordance to performance lift.
Orchestration autonomy: Replacing the agent with a rigid loop (same tools, fixed order) incurs substantial metric loss (e.g., $-0.27\kappa$ Hiring Assistant location).
Model quality: Downgrading the agent from GPT-5.4 to GPT-4o results in near-total failure on complex tasks.
Auto-rollback and guard-metric floor: Do not improve mean performance but reliably prevent runs terminating below seed (across 78 runs).

Cost analysis indicates SPEAR converges in $2$–$3$ full evals (compared to $\kappa=0.76$ 0500 for GEPA) while maintaining cost parity (within $\kappa=0.76$ 1) when including all sandbox LLM calls.

Architectural Implications

The primary architectural advance is SPEAR's coupling of agentic autonomy, active code execution, and monotone improvement guardrails. SPEAR is the first APE optimizer with active code analysis over structured evaluation output. Prior work in code-as-action (ReAct [yao2023react], Reflexion [shinn2023reflexion], ADAS [hu2024adas]) achieves free-form reasoning, but in non-APE domains. Prior APE systems (e.g., OPTO/Trace [cheng2024opto]) treat execution traces as passive feedback, lacking active analysis.

Theoretical and Practical Implications

By enabling the optimizer to autonomously group and analyze errors, SPEAR achieves rapid and robust improvements on tasks where error structure is non-trivial and text-only critique fails (multi-class confusion, label contradictions). Its architecture scales to industrial judge tasks, and the performance evidence indicates best-in-class optimization on such benchmarks. For tasks where output format mismatches dominate (some BBH-7 dimensions), SPEAR's advantage reflects its ability to rapidly rewrite output contracts, not deeper reasoning.

Practical impact includes reduction in convergence iterations, reliability of improvement, and empirical robustness across held-out splits and transfer to new task models. The Python tool is load-bearing for complex tasks, and SPEAR's flexible workflow makes it a plausible generalization for production APE deployments.

Limitations and Reproducibility

The optimizer model (GPT-5.4) is an internal deployment not publicly released; partial reproducibility is provided via released harness, prompts, and public benchmark data. Prompts for proprietary industrial tasks cannot be released, constituting a reproducibility gap. The threat model is cooperative; sandbox security is not adversarial-grade. Small- $\kappa=0.76$ 2 dev sets may induce metric variance.

Speculation on Future Developments

The SPEAR paradigm suggests broader adoption of agentic, code-augmented APE in domains requiring high-fidelity prompt optimization, especially where error structure eludes simple textual feedback. Future work may extend SPEAR's workflow to meta-optimization (optimizing optimizers), integrate adversarial sandboxing, and pursue cross-task transfer learning of prompt engineering strategies.

Conclusion

SPEAR leverages code-as-action in an agentic loop to outperform fixed-pipeline APE optimizers, with substantial performance gains on structurally complex judge tasks and competitive results on format-divergent public benchmarks. The active Python tool is uniquely load-bearing, and orchestration autonomy is empirically critical. The methodology is theoretically significant for automatic prompt engineering, and practically robust for industrial deployment scenarios.

Markdown Report Issue