FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Published 17 Jun 2026 in cs.SE and cs.AI | (2606.19605v1)

Abstract: Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a comprehensive FAPO framework that optimizes multi-step LLM pipelines by integrating prompt refinements with structural changes based on step-level evidence.
Methodology incorporates workspace isolation, agentic optimization, and iterative validation with rigorous variant tracking and scoped change proposals for reproducibility.
Empirical evaluation shows FAPO outperforms prompt-only strategies with average gains of +14.1 percentage points, demonstrating its effectiveness in complex LLM tasks.

Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines: A Technical Analysis

Motivation and Problem Formulation

The increasing complexity of multi-step LLM pipelines in domains such as security, analytics, and knowledge work necessitates optimization techniques that surpass single-prompt tuning. Failures in these pipelines occur due to intricate interdependencies among retrieval, reasoning, formatting, and control-flow steps. Prompt-only optimization methods are insufficient for such cases, as bottlenecks may arise from pipeline structure or step-level failures that simple prompt refinements cannot address.

FAPO (Fully Autonomous Prompt Optimization) is introduced as a comprehensive framework that addresses these challenges. The approach enables automated optimization not only of prompt text but also of pipeline structure, driven by step-level evidence attribution. Critically, FAPO maintains strict scope contracts and emphasizes reproducibility, variant tracking, and data hygiene throughout the optimization loop.

System Architecture and Methodology

FAPO adopts an agentic optimization paradigm, orchestrated by Claude Code. The workflow comprises several key stages:

Workspace Initialization: Each optimization task is encapsulated within a tenant workspace containing task definitions, prompts, evaluation rules, and change policies.
Evaluation and Recording: The current pipeline is executed over training cases, logging all intermediate and final outputs for granular failure attribution.
Step-level Failure Diagnosis: Attribution agents analyze errors at each pipeline stage, categorizing them as prompt-addressable or structural.
Scoped Change Proposal and Review: The optimizer proposes a single change, starting with prompt variations. Only if evidence indicates prompt edits are inadequate—and the scope allows—does it escalate to pipeline parameter or architectural modifications. All changes are reviewed for compliance, data leakage, and contract adherence before re-evaluation.
Iterative Validation: The process repeats, retaining variants that improve aggregate validation performance, and ensuring all outputs and variant histories are persistently logged.

Three core agents support this loop: the optimization agent (orchestrating changes), the step-attribution agent (identifying error modalities), and the variant-reviewer agent (enforcing constraints and safety).

Comprehensive guardrails include split access control, review of optimization scope, structured logging of all variants and their justifications, and immutability of accepted or rejected edits.

Experimental Evaluation

FAPO is empirically compared to the reflective prompt optimizer GEPA across six benchmarks and three diverse LLMs (GPT-4.1-mini, GPT-5.4-mini, Gemma 3-12B). The benchmarks were selected for their coverage of multi-hop QA (HotpotQA, HoVer), security classification (CTIBench-RCM), instruction following and verifiability (IFBench), mathematical reasoning (LiveBench-Math, AIME), and privacy-preserving delegation (Papillon).

Both FAPO and GEPA initiate optimization from identical pipeline and prompt baselines. FAPO’s scope encompasses both prompt and pipeline changes (except for CTIBench-RCM, which is prompt-only per protocol).

Key findings:

Aggregate Performance: FAPO outperforms GEPA in 15 of 18 model-benchmark pairs, achieving a mean improvement of +14.1 percentage points (pp). In 11 pairs, the improvements are statistically robust (non-overlapping mean±std ranges).
Structural Escalation Effectiveness: For tasks like HoVer and IFBench, where prompt edits could not resolve bottlenecks, FAPO escalated to architectural modifications—such as extending retrieval chains and adding constraint enforcement nodes—resulting in mean gains of +33.8 pp in the subset where escalation occurred.
Security Benchmarks: On CTIBench-RCM (prompt-only), FAPO achieves gains of +4.0 pp (GPT-5), +7.1 pp (Foundation-Sec-8B-Instruct), and +2.0 pp (Foundation-Sec-8B-Reasoning).
Model Asymmetries: Notable performance discrepancies between baseline models were traced to differences in token budgeting strategies during inference, impacting output quality and length.

Importantly, FAPO’s approach demonstrates that incorporating evidence-driven, pipeline-aware optimization offers significant improvements over prompt-only strategies, especially for tasks where structural choices (retrieval, answer formatting, and constraint enforcement) are critical.

Analysis of Optimization Dynamics

FAPO’s optimization trajectory is path-dependent, with observed trial variability attributable to whether structural interventions are discovered. The framework’s escalation policy ensures architectural changes only occur when clearly justified by attribution evidence, reducing unwarranted modifications and potential overfitting.

The reproducible, isolated tenant workspace model enhances both methodological rigor and practical applicability, especially in corporate or multi-tenant deployment scenarios.

Implications and Future Directions

The FAPO framework represents a significant advancement in the automation of LLM pipeline optimization, specifically by unifying prompt and structural search under robust, evidence-based controls. The results indicate that agentic, closed-loop optimization—driven by step-wise failure attribution and scoped escalation—can systematically improve the reliability and task-alignment of complex LLM pipelines.

Practically, this enables organizations to automatically refine workflows for domain-specialized applications (e.g., threat intelligence, privacy compliance) with improved efficiency and traceability. Theoretically, FAPO’s integration of multi-level search and variant provenance advances the science of open-ended LLM system engineering, laying the groundwork for future developments:

Adaptive, Multitask Optimizers: Bridging prompt, module, and chain-level search within a unified agentic environment.
Data- and Failure-Aware Guardrails: Continued enhancement of variant safety, tenant data isolation, and interpretability.
Generalization to Black-Box Settings: Applying these techniques to scenarios with restricted access or weaker supervision signals.

These directions converge toward the long-term goal of robust, self-improving pipeline construction in both research and production LLM systems.

Conclusion

FAPO offers a reproducible, tenant-based framework for fully autonomous, evidence-attuned pipeline optimization in multi-step LLM workflows. Its hybrid approach—prioritizing prompt refinement but judiciously escalating to structural changes—demonstrates robust, statistically significant gains over state-of-the-art prompt-only optimizers across both general-purpose and security-oriented benchmarks. FAPO’s methodological innovations in agentic loop design, step-wise error attribution, and rigorous scope control establish a new standard for automated LLM pipeline engineering, with enduring implications for scalable, dependable AI system design.

Reference: "FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines" (2606.19605)

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces FAPO, which stands for Fully Autonomous Prompt Optimization. Think of it as a smart “coach” for AI workflows that use several steps in a row (like searching, thinking, and formatting an answer). Instead of only tweaking the words in a single prompt, FAPO watches the whole process, figures out where things go wrong, and then fixes the right part—starting with the prompt and, if needed, adjusting how the steps are arranged.

What questions are the authors trying to answer?

They focus on three simple questions:

Can we make multi-step AI workflows better by looking at each step, not just the prompt?
How do we automatically find the real cause of errors (like bad search results, wrong reasoning, or broken formatting)?
If changing the prompt isn’t enough, can safe, small changes to the workflow structure boost accuracy?

How did they do it?

To keep this accessible, imagine an assembly line:

Step 1: Find information (retrieval)
Step 2: Think through it (reasoning)
Step 3: Present the answer correctly (formatting)

FAPO is like a careful supervisor:

It runs the whole assembly line on practice problems.
It keeps notes at each step to see where mistakes start.
It suggests one small, allowed change at a time.
It tests the new version and keeps it only if scores improve.

Here’s the loop in everyday terms:

Run the current workflow on training examples.
Spot what’s going wrong (missing facts, confused reasoning, messy output).
Propose a small fix (first try changing the prompt).
If prompt fixes don’t help enough and the rules allow it, adjust a setting or add a step (for example, add another search round or a format checker).
Check the change is safe and fair (no cheating with hidden data).
Keep the better version and repeat.

Some helpful analogies and terms:

“Prompt” = the instructions you give the AI.
“Pipeline” = the series of steps the AI follows.
“Attribution” = figuring out which step caused the mistake.
“Scope/guardrails” = rules about what you’re allowed to change so you don’t cheat or break things.
“Claude Code” = the coding assistant that edits and tests the workflow inside a standard code workspace.

What did they test, and what did they find?

They compared FAPO to another method called GEPA (which mainly optimizes prompts) across six different tasks and three different AI models. The tasks included:

Multi-hop questions that need info from several sources (HotpotQA, HoVer)
Following strict instructions exactly (IFBench)
Math problems (LiveBench-Math, AIME)
Protecting privacy while answering (Papillon)
A cybersecurity classification task (CTIBench-RCM)

Main results:

FAPO won in 15 out of 18 model–benchmark comparisons.
On average, FAPO improved accuracy by about +14.1 percentage points compared to GEPA.
The biggest gains were on tasks that needed stronger retrieval or strict formatting:
- HoVer and IFBench saw large jumps (often +30 points or more) when FAPO escalated from prompt edits to structural pipeline changes, like adding extra retrieval rounds or adding a step that checks the answer format.
On the security task (CTIBench-RCM), where only prompt edits were allowed, FAPO still improved accuracy across three different models (up to +7.1 percentage points).
One math benchmark (AIME) didn’t show consistent improvement, likely due to small sample sizes and strict formatting that makes small mistakes costly.

Why this matters:

Many failures weren’t due to weak prompts alone—some were due to not finding enough evidence or formatting answers incorrectly. FAPO’s ability to spot and fix the true bottleneck (not just the prompt) is what drove the gains.

Why is this important?

Real-world AI systems often use several steps. If one early step messes up (like poor search), later steps can’t fix it. FAPO shows how to diagnose and repair the right step.
The approach is careful and reproducible: every change is logged, checked, and tested under rules that prevent cheating.
It works for general tasks (fact-checking, instruction following, math) and specialized ones (cybersecurity).
The big idea: Don’t just polish the words you tell the AI—improve the whole process. That leads to more reliable and accurate systems.

Bottom line

FAPO is a practical way to improve multi-step AI workflows. It starts with simple prompt edits and only changes the pipeline’s structure if the evidence shows it’s necessary. Across many tests, this strategy beat a prompt-only optimizer, especially on tasks that need better evidence gathering or strict output formatting. This could help build AI systems that are more dependable in everyday and high-stakes settings (like security).

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper. Each item is phrased to suggest concrete follow-on research.

Optimization cost and efficiency: quantify end-to-end compute/query costs (tokens, wall-clock time, dollars) for both the orchestrator and task model per optimization round and per percentage-point gain; plot cost–performance frontiers versus variant budget (e.g., 10/25/50 variants).
Attribution validity: empirically validate the step-attribution subagent by creating labeled failure-cause datasets and reporting precision/recall/accuracy for attribution categories (e.g., retrieval insufficiency vs formatting vs reasoning).
Ablations on system components: isolate the contribution of each component (step-attribution, reviewer, scope contract, prompt-first policy) via controlled ablations to measure their impact on final scores and trial variance.
Orchestrator dependence: test alternative code agents (e.g., o-series, GPT-4o, Llama Code, SWE-bench-tuned agents) or a human-in-the-loop baseline to measure how much FAPO’s gains rely on Claude Code specifically.
Search strategy design: compare the prompt-first escalation policy to alternatives (e.g., interleaved multi-level search, bandit-style level selection, population-based search) under fixed budgets and report convergence/stability trade-offs.
Convergence and stop criteria: formalize and evaluate stopping rules (e.g., expected improvement thresholds, repeated-plateau tests) and measure cycling or premature convergence under different seeds and budgets.
Path dependence: characterize and mitigate trajectory path dependence (e.g., via multiple restarts, diversity maintenance, Pareto front tracking) and quantify its contribution to the large trial standard deviations observed.
Overfitting controls: go beyond split-access guardrails with nested cross-validation, multiple random splits, or progressive validation to quantify and reduce validation overfitting across optimization rounds.
Statistical rigor: replace “non-overlapping mean ± sd” with formal significance tests (e.g., stratified bootstrap CIs, permutation tests) and report effect sizes with confidence intervals.
Baseline fairness: include additional strong baselines (DSPy/MIPROv2, OPRO, EvoPrompt/PromptBreeder, TextGrad) under matched budgets and identical chain scaffolds to separate FAPO’s agentic advantages from scope differences.
Provider budget asymmetry: rerun comparisons with harmonized visible-output budgets or provider-normalized reasoning budgets to isolate model-quality effects from accounting differences in hidden reasoning tokens.
Generalization across models: test cross-model transfer (apply a pipeline optimized on model A to models B/C) and report retained gains vs model-specific overfitting.
Cross-tenant transfer: study whether learned fixes (prompts, structural patches, post-processors) transfer across tenants with related failure modes (e.g., retrieval expansion modules for multi-hop tasks).
Robustness to distribution shift: evaluate FAPO-optimized pipelines under out-of-domain or temporally shifted data and measure performance decay and re-optimization cost.
Real-world deployment metrics: quantify latency, throughput, and cost impacts of structural interventions (e.g., extra retrieval hops, deterministic validators) and analyze SLO/SLA trade-offs.
Safety and guardrail robustness: red-team the optimizer to test leakage prevention, scorer integrity, and scope contract enforcement; report breach rates and fixes (e.g., sandboxing, policy provers).
Scorer gaming and brittleness: assess whether FAPO learns to exploit scorer idiosyncrasies (e.g., formatting hacks); add adversarial/scorer-perturbation tests and robust scoring variants.
Attribution-driven structural edits: provide systematic analyses of recall/precision trade-offs in retrieval expansions and deterministic format enforcement, including when such changes hurt performance.
AIME underperformance: diagnose why AIME lags (e.g., reasoning truncation, extraction failures, small-sample overfitting) and test targeted remedies (budget tuning, symbolic post-checkers, math tool use).
Multimodal and tool-use breadth: extend evaluation to multimodal tasks, code-generation with execution feedback, and complex tool-use/agentic planning pipelines to test FAPO’s generality.
Long-context and memory-heavy scenarios: evaluate pipelines requiring 100k+ tokens or memory components; measure how FAPO manages retrieval/routing under long-context constraints.
Scope contract design: study how restrictive vs permissive scope contracts affect outcomes, safety incidents, and costs; propose templates and auto-tuning of scopes for different task families.
Reproducibility under model churn: quantify sensitivity of optimized variants to provider version changes; propose version pinning, shadow evaluations, or robustness checks against model updates.
Tokenization and prompting sensitivity: perform sensitivity analyses over temperature/top-p, system prompts, and tokenization differences to map stable vs brittle regions of the search space.
Variant management at scale: investigate search-space bloat and provenance (thousands of immutable variants), proposing pruning, lineage clustering, and “design rules” learned from histories.
Explainability of changes: generate automatic rationales and diffs for accepted variants that map changes to failure clusters; measure human interpretability and auditability.
Multi-objective optimization: introduce joint objectives (accuracy, latency, cost, privacy leakage) and evaluate Pareto-efficient pipelines, not just single-metric winners.
Data governance in multi-tenant settings: formally verify isolation with synthetic leakage tests; document and test data residency/compliance constraints in tenant workflows.
Theoretical framing: relate FAPO’s loop to black-box optimization (e.g., EAs/BO); analyze sample complexity, expected improvement under noisy scorers, and conditions favoring structural vs prompt-level edits.
Lifecycle and maintenance: study how often re-optimization is required under drift, the amortized cost of keeping pipelines performant, and triggers for automatic re-optimization.

View Paper Prompt View All Prompts

Practical Applications

Below is a concise mapping from the paper’s findings and tooling to concrete, real-world applications. Each item notes the sector, the actionable use case, and any tools/workflows and assumptions/dependencies that affect feasibility.

Immediate Applications

The following applications can be deployed now using the FAPO framework, its tenant workspaces, LangGraph-based pipelines, and Claude Code-driven prompt/pipeline search.

Software/AI Platforms — Continuous optimization of LLM pipelines in CI/CD
- What: Add FAPO as a “self-healing” step in CI to evaluate current chains, attribute failures (retrieval, reasoning, formatting), and propose reviewed prompt or chain changes, then accept only if validation improves.
- Tools/workflows: LangGraph pipelines; FAPO tenant workspaces; “optimization,” “step-attribution,” and “variant-reviewer” agents; eval-runner; scope contract and guardrails; GitHub PR gating for prompt/chain variants.
- Assumptions/dependencies: Clear evaluation metric; train/validation/test separation; intermediate artifact logging enabled; access to model APIs; reviewer-approved scope contract.
Security — CVE-to-CWE mapping and SOC triage enrichment
- What: Use FAPO’s prompt optimization to increase accuracy for CVE→CWE mapping and enforce output formats for downstream tooling (e.g., ticketing, dashboards).
- Tools/workflows: Prompt-only optimization (as in CTIBench-RCM); deterministic post-processing nodes for ID extraction; tenant isolation for sensitive data; Foundation-Sec models or frontier APIs.
- Assumptions/dependencies: Labeled CVE–CWE data; well-defined exact-match scoring; privacy controls and data handling policies; model/API budget.
Enterprise Analytics/Knowledge Management — Retrieval augmentation fixes for QA/verification
- What: Address multi-hop retrieval failures (e.g., HoVer-like tasks) by extending retrieval hops, adding multi-query BM25, and entity-aware rescue when prompt edits plateau.
- Tools/workflows: Step-attribution to detect retrieval bottlenecks; permitted escalation to add retrieval nodes; caching and BM25 index configuration within tenant workspace.
- Assumptions/dependencies: Permission to modify chain structure; reproducible indices; latency/cost budgets for additional retrieval.
Compliance/Operations — Deterministic constraint enforcement for instruction following
- What: Improve instruction-following reliability (IFBench-like) via post-processing nodes that validate and correct format/constraints (e.g., templates, JSON schemas).
- Tools/workflows: Deterministic validator/corrector nodes; FAPO attribution to pinpoint format failures; variant-reviewer to ensure scorer compatibility and placeholder integrity.
- Assumptions/dependencies: Unambiguous constraints; programmatic validators; scope allowing non-prompt edits.
Privacy/PII Handling — Privacy-conscious routing and redaction
- What: Optimize prompts and routing policies (Papillon-like) to balance utility with PII leakage minimization in multi-model ensembles.
- Tools/workflows: PII-aware prompts; optional redaction/pre-filtering nodes; tenant-scoped privacy rules; audit logs of intermediate artifacts.
- Assumptions/dependencies: PII detection tools; organization-specific privacy policies; governance review of logging and storage.
Financial/Business Reporting — Structure-preserving extraction and templated outputs
- What: Enforce stable structured outputs (e.g., JSON, tabular summaries) from reports and filings by combining prompt tuning with deterministic post-processing.
- Tools/workflows: Format validators; exact-match or schema-conformance scorers; scope-controlled chain edits; regression tests in eval-runner.
- Assumptions/dependencies: Robust scorers; domain schemas; access to representative validation sets to avoid overfitting.
Education/Assessment — Auto-grading and format enforcement for short-answer tasks
- What: Reduce near-miss and format errors on math/short-answer tasks by constraining brevity and answer extraction (e.g., “final answer only,” LaTeX/boxed output).
- Tools/workflows: Brevity and format prompts; parser/validator nodes; attribution for “verbose” vs. “wrong-answer” separation.
- Assumptions/dependencies: Reliable parsers; clear grading rubric; careful budget settings to avoid truncation-induced failures.
Research/Academia — Reproducible prompt and pipeline studies
- What: Use tenant workspaces and variant immutability to run controlled comparative studies, ablations, and prompt evolution with transparent artifacts.
- Tools/workflows: Standardized eval-runner; variant history; split access controls; per-tenant isolation; CLAUDE.md repo guidance.
- Assumptions/dependencies: Benchmarks with trusted scorers; compute budget; agreed-upon scope contracts across collaborators.
Governance/Policy — Prompt change control, audit, and compliance
- What: Treat prompt and pipeline edits like code changes: scoped contracts, independent review, immutable variants, and audit trails for compliance (e.g., SOX-like change control, AI governance).
- Tools/workflows: Scope contract; variant-reviewer gate; iteration logs; split access controls; artifact retention.
- Assumptions/dependencies: Organizational policy alignment; auditable storage; segregation of duties between optimizer and reviewer.
Daily Life/Personal Productivity — Reliable template filling and document cleanup
- What: Add “final-pass” deterministic checks to personal automations (e.g., resumes, invoices, emails) to avoid verbose or malformed outputs after LLM generation.
- Tools/workflows: Lightweight LangGraph chain; format validators; small local datasets for validation.
- Assumptions/dependencies: Willingness to maintain a simple validation set; available local/runtime environment.

Long-Term Applications

The following applications are feasible with further research, scaling, or development, building on FAPO’s prompt-first, attribution-driven escalation and tenant-based governance.

Software/AI Platforms — Always-on, cost/safety-aware self-optimizing production systems
- What: Continuous online optimization with drift detection, budget-aware reasoning depth, A/B testing, and automatic rollback, balancing latency, accuracy, and cost.
- Tools/workflows: Live traffic shadow evals; policy engines for cost and safety; dynamic scope escalation policies; model routers.
- Assumptions/dependencies: Robust online metrics; safe rollback; strong guardrails to avoid overfitting to live data; regulatory acceptance of autonomous changes.
Security — Closed-loop red-team/blue-team co-optimization
- What: Integrate adversarial evaluation (red) with FAPO (blue) for rapid “attack-then-fix” loops that harden pipelines against jailbreaks and data exfiltration.
- Tools/workflows: Adversarial benchmarks; structured failure attribution for security modes; approved chain changes (sanitization, policy enforcement).
- Assumptions/dependencies: Safe adversarial testing environments; incident response integration; alignment with disclosure policies.
Healthcare — Verifiable clinical coding and evidence-linked summarization
- What: Optimize pipelines that map clinical text to standardized codes (e.g., ICD, SNOMED) and generate summaries with traceable multi-hop evidence.
- Tools/workflows: Evidence-aware retrieval chains; clinical schema validators; provenance recording in step artifacts.
- Assumptions/dependencies: HIPAA/GDPR compliance; domain datasets; human-in-the-loop review; rigorous scorers to prevent harmful errors.
Finance/Regulatory — Compliance-grade, explainable information extraction
- What: Build explainable, auditable pipelines for KYC/AML, filings analysis, and policy compliance with deterministic constraint enforcement and full artifact trails.
- Tools/workflows: Deterministic post-processing; explainability dashboards for step artifacts; regulator-facing audit packs.
- Assumptions/dependencies: Standardized schemas; regulator-approved evaluation; strict data lineage and retention policies.
Government/Policy — Standards for AI pipeline auditability and reproducibility
- What: Use FAPO’s tenant model as a blueprint for policy around AI change management: scope contracts, split access controls, artifact retention, and variant immutability.
- Tools/workflows: Compliance frameworks codifying tenant patterns; certification checklists; interoperability profiles for artifact logging.
- Assumptions/dependencies: Multi-stakeholder agreement on standards; cross-vendor tooling support; legal frameworks for automated code edits.
Education — Verifiable, personalized tutoring with constraint checks
- What: Tutors that optimize for verifiable instruction following, emit step-by-step reasoning, and enforce format correctness for grading and learner feedback.
- Tools/workflows: Per-learner tenants; formative assessment datasets; structured feedback validators.
- Assumptions/dependencies: Privacy-preserving telemetry; robust evaluation of learning outcomes; bias and fairness assessments.
Robotics/Autonomy — Plan-and-execute pipeline optimization
- What: Attribute failures to perception/reasoning/actuation steps and optimize prompts/structure (e.g., planning depth, safety checks) under simulator-based validation.
- Tools/workflows: Multi-modal step artifacts; simulation-based scorers; safety constraint nodes.
- Assumptions/dependencies: High-fidelity simulators; reliable perception logs; real-world transfer studies.
Energy/Operations Research — Schedule/dispatch assistants with hard constraints
- What: Add deterministic constraint enforcement to LLM-based scheduling/optimization assistants, with step-level attribution guiding structural changes (e.g., solver-in-the-loop).
- Tools/workflows: Hybrid LLM–OR pipelines; constraint-checking nodes; attribution for infeasibility vs. format errors.
- Assumptions/dependencies: Domain-specific solvers and datasets; rigorous constraint definitions; safety certification.
Multi-modal and Cross-model Optimization — Generalized attribution across modalities and routers
- What: Extend step attribution and scope escalation to image/audio/video tasks and model-routing policies (e.g., “when to call reasoning models”).
- Tools/workflows: Multi-modal artifact capture; router policies optimized under guardrails; cost/latency-aware objectives.
- Assumptions/dependencies: Unified artifact schema across modalities; accurate and fast multi-modal scorers; provider limits on hidden reasoning tokens.
Ecosystem/Marketplace — Shareable optimization recipes and attribution libraries
- What: Distribute reusable “recipes” for common failure modes (e.g., retrieval coverage boosters, constraint enforcers) and attribution dashboards across organizations.
- Tools/workflows: Template tenants; plug-in libraries for validators, retrievers, and post-processors; community benchmarks.
- Assumptions/dependencies: Interoperability across stacks (LangGraph/other); licensing for shared artifacts; governance for recipe provenance.

Notes on general feasibility

Key dependencies: Clear, measurable objective functions; representative validation sets; reliable step-level logging; compute and token budgets; data governance and privacy constraints; access to Claude Code or equivalent coding agents; scope contracts that reflect organizational risk tolerance.
Risk/assumption highlights: Potential overfitting without strict split hygiene; path dependence in optimization trajectories; vendor/API token budget asymmetries; regulatory limits on autonomous code changes; need for human oversight on structural edits in safety-critical domains.

View Paper Prompt View All Prompts

Glossary

Abstention: A failure mode where the model declines to answer rather than producing a result. "abstention (model declined to answer, 8 cases)"
AIME: A competition-style math benchmark with short exact answers used to evaluate reasoning. "AIME is the only benchmark where GEPA leads."
AgentBench: A benchmark suite for evaluating LLMs acting as agents across tasks. "Evaluation suites such as HELM [15], BIG-bench [29], and AgentBench [17] measure model capabilities."
Best-of-N: An adversarial search strategy that succeeds if any of N attempts achieve the objective. "In red-teaming, the target is often adversarial and Best-of-N: under a fixed query budget, generate or refine candidates until at least one prompt elicits a jailbreak."
BIG-bench: A wide-ranging benchmark suite for assessing LLM capabilities. "Evaluation suites such as HELM [15], BIG-bench [29], and AgentBench [17] measure model capabilities."
BM25: A classic lexical retrieval function used to fetch relevant documents. "Multi-hop QA. We replicate the GEPA HotpotQA [34] pipeline as a six-node LangGraph chain: two BM25 retrieval nodes (k = 7) and four LLM nodes."
Chain architecture: The structural design of a multi-step pipeline (nodes and their connections). "For the non-CTIBench- RCM tasks, the scope contract permits escalation to chain-parameter or chain-architecture changes only when prompt optimization appears insufficient and the attribution report identifies a bottleneck that prompts are unlikely to fix."
Chain parameters: Tunable settings of pipeline components (e.g., k in retrieval, sampling settings). "For the non-CTIBench- RCM tasks, the scope contract permits escalation to chain-parameter or chain-architecture changes only when prompt optimization appears insufficient and the attribution report identifies a bottleneck that prompts are unlikely to fix."
Chain-of-thought: A prompting/program structure that elicits intermediate reasoning steps. "GEPA optimizes the instruction string inside a fixed DSPy chain-of-thought program using MIPROv2-Heavy evolutionary search."
Claude Code: An agentic coding tool used as the orchestrator for autonomous pipeline optimization. "FAPO uses Claude Code [3] as its orchestrator optimization layer"
CTIBench-RCM: A security benchmark mapping vulnerability descriptions to weakness categories. "CTIBench-RCM [2] maps CVE descriptions to CWE IDs."
CVE: Common Vulnerabilities and Exposures, standardized identifiers for security vulnerabilities. "maps CVE descriptions to CWE IDs."
CWE: Common Weakness Enumeration, a taxonomy of software weakness types. "maps CVE descriptions to CWE IDs."
Deterministic constraint enforcement: Hard-coded enforcement of output requirements to ensure instruction following. "The largest improvements occur on HoVer [11] and IFBench [25], where FAPO extends retrieval chains or introduces deterministic constraint enforcement."
Deterministic post-processing nodes: Fixed logic steps added after model outputs to enforce formats or constraints. "FAPO added deterministic post-processing nodes that enforce instruction constraints."
DSPy: A framework for compiling declarative LLM programs into optimized pipelines. "Prompt-programming systems such as DSPy [13] optimize LLM-based modules; GEPA [1] optimizes prompts inside pipelines."
Entity-aware rescue: A retrieval augmentation that targets entities to recover missing evidence. "with multi-query BM25 search and entity-aware rescue."
Eval config: A configuration file specifying a reproducible pipeline setup and selected variants. "An eval config defines a reproducible chain configuration by specifying parameters as well as selecting variants (versions of prompts and chains that are generated during optimization)."
Exact match (EM): A strict scoring metric that requires exact equality between prediction and reference. "The optimization metric is exact match (EM), following GEPA's protocol; F1 is retained only as an auxiliary diagnostic."
Exact-match scoring: Evaluating correctness by requiring exact identifier or string matches. "exact-match scoring on extracted CWE IDs."
Evolutionary search: An optimization method that iteratively mutates and selects candidates based on fitness. "using MIPROv2-Heavy evolutionary search."
FAPO: Fully Autonomous Prompt Optimization, an agentic framework to optimize prompts and pipelines via evidence-driven iterations. "We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase."
GEPA: A prompt optimizer that evolves instructions within fixed multi-step programs. "Prompt-programming systems such as DSPy [13] optimize LLM-based modules; GEPA [1] optimizes prompts inside pipelines."
Guardrails: Constraints and checks that bound automated optimization to prevent overreach or leakage. "scope-constrained guardrails,"
HELM: A holistic evaluation suite for LLMs across many dimensions. "Evaluation suites such as HELM [15], BIG-bench [29], and AgentBench [17] measure model capabilities."
HoVer: A many-hop fact verification benchmark requiring evidence aggregation across documents. "HoVer [11] is a many-hop fact-verification task."
IFBench: A benchmark for verifiable instruction following with explicit constraint checks. "IFBench [25] measures verifiable instruction following."
Jailbreaking: Adversarial prompting to circumvent model safety or constraints. "Prompt-space search and optimization have already been extensively explored in the jailbreaking literature."
LangGraph: A library for building stateful LLM workflows as graphs. "FAPO uses LangGraph [14] to represent the pipeline as a stateful graph."
LiveBench-Math: A contamination-limited benchmark for evaluating mathematical reasoning. "LiveBench-Math [32] evaluates mathematical reasoning on contamination-limited benchmark problems."
MIPROv2-Heavy: A specific evolutionary search configuration used in GEPA’s optimization. "GEPA optimizes the instruction string inside a fixed DSPy chain-of-thought program using MIPROv2-Heavy evolutionary search."
Multi-hop QA: Question answering that requires retrieving and reasoning over multiple pieces of evidence. "Multi-hop QA. We replicate the GEPA HotpotQA [34] pipeline as a six-node LangGraph chain: two BM25 retrieval nodes (k = 7) and four LLM nodes."
NVD: National Vulnerability Database; a standard reference whose conventions affect labeling/abstraction. "the phrase "standard NVD abstraction level.""
Optimization budget: The cap on the number of variants or rounds used during automated search. "The FAPO budget is limited to 50 variants or 10 optimization rounds per trial, whichever comes first."
Papillon: A benchmark evaluating privacy-preserving delegation under utility constraints. "Papillon [28] evaluates privacy-conscious delegation."
Prompt-first policy: A strategy that prioritizes prompt edits before escalating to pipeline changes. "FAPO still followed a prompt-first policy so that structural changes were considered only after prompt-level search exposed a structural bottleneck."
Prompt-space search: Searching over prompt text (and related prompt parameters) to optimize performance. "Prompt-space search and optimization have already been extensively explored in the jailbreaking literature."
Red-teaming: Systematic adversarial testing of models to expose failures. "In red-teaming, the target is often adversarial and Best-of-N: under a fixed query budget, generate or refine candidates until at least one prompt elicits a jailbreak."
Reflector model: A helper model used to critique or evolve prompts in optimization loops. "the reflector model, which we replace with Claude Opus 4.6 through Amazon Bedrock using provider-default settings."
Retrieval chain: A multi-hop sequence of retrieval steps to gather sufficient evidence. "extended the baseline 3-hop retrieval chain to 4-5 hops, with multi-query BM25 search and entity-aware rescue."
Scope contract: A specification of what kinds of changes (prompt, parameters, structure) are allowed. "It then writes a scope contract."
Step-attribution: Classifying errors by pipeline step and fix type to guide optimization. "The step-attribution subagent analyzes failures after each evaluation."
Tenant: An isolated workspace encapsulating a task’s code, data, rules, and optimization history. "FAPO organizes optimization around a tenant, the unit used throughout the paper to represent a task with an evaluation criteria and workflows."
Tenant playbook: The primary policy document defining layout, constraints, and guidance for a tenant. "The tenant playbook describes the tenant on a high-level, describes the layout of the tenant code and data, and specifies the constraints of the optimization."
Variant immutability: The rule that every attempted change is preserved as a new version, avoiding in-place edits. "Variant immutability: Every accepted or rejected attempt gets a new variant file."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Summary

Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines: A Technical Analysis

Motivation and Problem Formulation

System Architecture and Methodology

Experimental Evaluation

Analysis of Optimization Dynamics

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the authors trying to answer?

How did they do it?

What did they test, and what did they find?

Why is this important?

Bottom line

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Summary

Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines: A Technical Analysis

Motivation and Problem Formulation

System Architecture and Methodology

Experimental Evaluation

Analysis of Optimization Dynamics

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the authors trying to answer?

How did they do it?

What did they test, and what did they find?

Why is this important?

Bottom line

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research