Genetic-Pareto (GEPA) Optimization
- Genetic-Pareto (GEPA) is a framework that treats prompt evolution as an evolutionary search problem, using natural-language feedback and Pareto retention to optimize compound AI systems.
- It leverages instance-wise evaluation and reflective prompt mutation to maintain a diverse set of high-performing candidates across various tasks.
- Its practical applications include enhancing performance in scientific reasoning, pragma generation, and medical note error detection, with significant empirical improvements reported.
Searching arXiv for papers on GEPA and closely related Genetic-Pareto methods. Genetic-Pareto (GEPA) is a prompt optimization framework in which prompt or system-instruction candidates are iteratively improved through natural-language reflection and retained through Pareto-based selection. In its most explicit formulation, GEPA is designed for compound AI systems composed of one or more language-model modules, potentially interleaved with tools and control flow, and it adapts such systems by evolving prompts rather than model weights (Agrawal et al., 25 Jul 2025). Subsequent work has used the name for both the DSPy GEPA module and for custom simplified variants, especially in prompt engineering for scientific reasoning, risk-of-bias assessment, OpenACC pragma generation, and medical note error detection; across these uses, the recurring elements are reflective prompt revision, population-level candidate tracking, and preservation of diverse non-dominated strategies rather than immediate collapse to a single incumbent (Pandey et al., 30 Mar 2026).
1. Definition and scope
GEPA is expanded as Genetic-Pareto, though one application paper writes the acronym as GEnetic-PAreto (Jhaveri et al., 12 Jan 2026). The defining idea is that prompt optimization can be treated as an evolutionary search problem in which candidate prompts are revised in language space and assessed using multiple signals or multiple instances, while a Pareto mechanism preserves diverse high-performing candidates (Agrawal et al., 25 Jul 2025).
In the core prompt-optimization formulation, GEPA targets compound AI systems rather than isolated prompts. A system is formalized as , where is a set of language modules, is control flow, and each module has a prompt , model weights , and input/output schemas. The optimization target is the collection of prompts , with the goal of maximizing a task metric under a rollout budget (Agrawal et al., 25 Jul 2025). This formulation places prompt evolution in the same optimization space as modular inference programs, retrieval pipelines, and tool-augmented systems.
Later papers use GEPA at different levels of abstraction. Some adopt the DSPy GEPA module directly and describe it as a prompt optimization algorithm using natural language reflection and Pareto-guided search (Li et al., 1 Dec 2025). Others instantiate only part of the framework. The scientific-reasoning study "Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners" presents a custom simplified variant tailored to Lean theorem proving and GPQA, with a Pareto archive, LLM-based critique, and prompt rewriting, but without a full formal exposition of standard GEPA (Pandey et al., 30 Mar 2026). The OpenACC work similarly presents GEPA as an existing framework specialized to pragma generation rather than a new algorithm from first principles (Jhaveri et al., 12 Jan 2026).
This variation in usage is important. In the literature, “GEPA” can denote three related but non-identical objects: the general reflective prompt evolution framework, the DSPy implementation used in application pipelines, and a family of custom GEPA-like procedures that preserve the core pattern of reflection plus Pareto retention while omitting parts of the original machinery. This suggests that GEPA is already functioning both as a concrete optimizer and as a broader design pattern for language-mediated evolutionary search.
2. Core optimization loop
The canonical GEPA loop consists of candidate selection, rollout collection, reflective prompt mutation, acceptance testing, and Pareto-set evaluation (Agrawal et al., 25 Jul 2025). The optimizer receives a system , a training dataset , an evaluation metric , a textual feedback function , a rollout budget 0, a minibatch size 1, and a Pareto set size 2. Accessible optimization data are split into a feedback set and a Pareto set. GEPA initializes a candidate pool 3, evaluates the initial candidate on Pareto instances, and stores a vector of per-instance scores rather than only an aggregate score.
At each iteration, GEPA selects a candidate system to evolve, chooses a target module, samples a minibatch from the feedback set, and executes the system on that minibatch to collect trajectories and module-specific feedback. It then calls an update step of the form
4
which rewrites the target module’s prompt in natural language. A child system is formed by copying the parent and replacing only that module prompt. The child is subjected to a local acceptance test: if its minibatch average score improves over the parent’s, it is admitted to the candidate pool and evaluated on the full Pareto set; otherwise it is discarded (Agrawal et al., 25 Jul 2025).
The same structural logic appears in domain-specific variants. In the scientific-reasoning paper, the complete loop is given procedurally in Algorithm 1. The algorithm starts from Population <- {P0}, Pareto <- ∅, samples a prompt from the Pareto set if non-empty or from the current population otherwise, samples subsets of Lean theorems and GPQA questions, evaluates the prompt using Lean_Verify and Check_Answer, updates the Pareto archive, collects failed items, generates Critique <- LLM_Critic(P, Logs(Errors)), evolves the prompt via P' <- LLM Evolve(P, Critique), prunes the population to the Pareto archive plus the new child, and returns the Pareto set after 5 iterations (Pandey et al., 30 Mar 2026). That variant omits crossover, objective formalization, archive-size specification, and several standard multi-objective details, but it preserves the GEPA pattern of evaluation 6 critique 7 prompt rewrite 8 Pareto retention.
The OpenACC application adds a domain-specific evaluation loop. A candidate prompt is used with a student model to fill either a <DM_PRAGMA> or <LP_PRAGMA> placeholder in code; the generated pragma is normalized to a canonical map, compared semantically to an expert gold pragma, and converted into a structured mismatch report identifying missing clauses, unnecessary clauses, directive mismatches, and clause-parameter errors. This score-plus-feedback bundle is then used to mutate the prompt, while Pareto-front maintenance preserves diverse prompt candidates (Jhaveri et al., 12 Jan 2026).
A consistent feature across these implementations is that mutation is semantic prompt rewriting, not token-level perturbation. GEPA evolves complete instructional texts that can expand from short baseline prompts into long, rule-bearing documents containing task decomposition, output constraints, domain heuristics, error handling protocols, and explicit “what not to do” guidance. In that sense, the operative search space is neither a fixed template space nor a few-shot ordering problem, but the open-ended space of natural-language policy descriptions for model behavior.
3. The Pareto mechanism
The “Pareto” in GEPA refers to preserving multiple non-dominated candidates rather than greedily following a single current best prompt. In the core formulation, GEPA tracks per-instance validation performance and computes, for each Pareto instance 9, the best score any candidate has achieved:
0
It then forms the set of candidates that attain that best score on at least one instance, unions these instance-wise elites, removes strictly dominated candidates, counts how many instance-wise elite sets each remaining candidate appears in, and samples the next candidate to mutate with probability proportional to that frequency (Agrawal et al., 25 Jul 2025). This is not a simple average-score optimizer. It is a diversity-preserving instance-wise selection rule intended to prevent early search collapse.
The resulting search behavior is explicitly contrasted with a greedy baseline that always mutates the current best candidate. In ablations on Qwen3-8B, the greedy SelectBestCandidate alternative scores 54.89 aggregate versus 61.28 for GEPA, with especially large degradations on IFBench and HoVer (Agrawal et al., 25 Jul 2025). The reported interpretation is that Pareto-based candidate retention maintains multiple “winning” strategies across different validation instances, whereas greedy search often stalls after discovering a strong but narrow early strategy.
Later papers use the Pareto label more loosely or with partial formalization. The scientific-reasoning paper states that prompts are added if “non-dominated” and dominated prompts are removed, but it does not explicitly define the dominance relation, the objective vector, whether objectives are aggregated per item or per benchmark, or how ties are handled (Pandey et al., 30 Mar 2026). The RoB-assessment paper consistently refers to Pareto-guided search and Pareto-front evaluations, and states that prompts are evaluated against predefined criteria such as accuracy, faithfulness, and conciseness, but it does not print an explicit Pareto-front equation or objective-function formalization (Li et al., 1 Dec 2025). The medical-note paper describes frontier maintenance in especially instance-specific terms: a prompt is retained because it demonstrates superior performance on at least one validation instance, thereby preserving a diverse range of successful strategies (Myles et al., 25 Feb 2026).
A compact comparison of representative GEPA uses illustrates the extent of variation.
| Paper | GEPA role | Pareto specification |
|---|---|---|
| "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (Agrawal et al., 25 Jul 2025) | General reflective prompt optimizer for compound AI systems | Explicit instance-wise elite preservation and candidate sampling |
| "Beyond the Answer" (Pandey et al., 30 Mar 2026) | Custom simplified prompt evolution for Lean and GPQA | Pareto archive present, but dominance/objectives only partially specified |
| "Automated Risk-of-Bias Assessment..." (Li et al., 1 Dec 2025) | DSPy GEPA prompt optimization for RoB domains | Pareto-guided search described procedurally, not formally defined |
This diversity of formulations has two consequences. First, Pareto retention is central enough to remain a defining GEPA feature across papers. Second, “Pareto” does not always denote a fully specified multi-objective evolutionary algorithm; in several application papers it functions more as an archive-and-diversity principle than as a mathematically complete MOEA specification. A plausible implication is that the GEPA literature values Pareto behavior primarily for search diversification and auditability of alternatives, rather than for strict adherence to a standard MOEA formalism.
4. Reflection, feedback, and prompt mutation
GEPA is distinguished from conventional prompt search by the role of natural-language reflection. Rather than deriving updates from scalar reward alone, GEPA exposes execution traces, model outputs, tool interactions, and evaluator diagnostics to a reflection model that synthesizes prompt revisions as explicit rules (Agrawal et al., 25 Jul 2025). The core claim is that language itself is an interpretable learning medium: one rollout can expose not only whether a module failed, but also why it failed and what instruction should change.
This reflective machinery is expressed differently across domains. In the RoB pipeline, GEPA is embedded in a DSPy program with paired Signatures, curated train/validation sets, and a two-stage reasoning process consisting of evidence identification followed by evaluative judgment. The optimizer produces intermediate prompt candidates, reasoned model outputs, Pareto-front evaluations, the final compiled prompt, and comprehensive execution traces, and these artifacts are described as inspectable, versionable, and shareable (Li et al., 1 Dec 2025). The process is intentionally auditable: fixed random seed 42, deterministic decoding with temperature = 0.0 and top_p = 1.0, and repeated optimization runs are used to study stability, even though the authors note that heuristic multi-objective search can still yield distinct yet equally valid optima.
In the medical-note setting, the reflector receives richer-than-binary supervision. For each rollout, GEPA provides a natural-language critique, the model prediction, the ground truth, and, when a note contains an error, the actual erroneous sentence and its corrected version (Myles et al., 25 Feb 2026). The authors argue that this allows the reflector to infer the missing clinical reasoning rather than only observing an error flag. The optimized prompts shown in the appendix therefore grow to include conservative decision-threshold guidance, scope restrictions on what counts as a medical error, domain-specific examples, formatting constraints, and a final checklist.
In the OpenACC setting, reflection is tightly coupled to a structured semantic scoring pipeline. Predicted and gold pragmas are normalized by a map 1, clause-level and parameter-level precision/recall are computed, and the total score is
2
Mismatch categories such as missing collapse, missing reduction, incorrect clause parameters, or unnecessary data-movement clauses are translated into prompt hints and corrective actions, which are then used by the reflection model to produce mutated prompts (Jhaveri et al., 12 Jan 2026). This is a particularly explicit example of semantic deltas becoming natural-language repair advice.
Across applications, GEPA’s prompt mutations are therefore not arbitrary rewrites. They are attempts at rule induction from failures and successes: identifying latent task regularities, converting them into declarative instructions, and reincorporating them into the next-generation prompt. This suggests that GEPA is best understood not as a black-box search heuristic but as a procedure for externalizing model-facing procedural knowledge into editable natural language.
5. Applications and empirical behavior
GEPA has been applied across heterogeneous domains, but the reported empirical pattern is strikingly consistent: optimized prompts often become longer, more structured, and more domain-specific, and they can materially improve performance without model fine-tuning.
In the original GEPA paper, test-time gains are reported on HotpotQA, IFBench, HoVer, and PUPA for both Qwen3-8B and GPT-4.1 mini. On Qwen3-8B, baseline aggregate performance is 48.85, MIPROv2 reaches 55.11, GRPO reaches 51.14, and GEPA reaches 61.28; on GPT-4.1 mini, baseline is 52.67, MIPROv2 is 59.71, GEPA is 66.97, and GEPA+Merge is 68.69 (Agrawal et al., 25 Jul 2025). The same paper reports that GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts, and that GEPA/GEPA+Merge prompts can be up to 9.2x shorter than MIPROv2’s prompt configurations.
The scientific-reasoning paper uses GEPA not primarily as a benchmark optimizer but as an interpretability probe. Optimization is performed entirely on DeepSeek-V3.2, then prompts are transferred to ChatGPT-5.4-mini, GLM 5, and Claude Sonnet 4.6. On the optimization model, GEPA Optimized Final reaches 100.00% on Algebra versus 97.22% for Hand-Crafted CoT, and 94.44% on GPQA versus 91.67% for Hand-Crafted CoT. Transfer, however, is inconsistent: GPT-5.4-mini improves on Algebra from 50.00% to 61.11% but not on GPQA; GLM 5 remains best with hand-crafted CoT on Algebra (97.22% vs 94.44% for GEPA Final); and Claude Sonnet 4.6 does worse with GEPA than with hand-crafted CoT on both tasks (Pandey et al., 30 Mar 2026). The paper terms this phenomenon “local logic” and interprets GEPA’s transfer failures as evidence that prompt-induced reasoning gains can be model-specific.
The RoB-assessment study reports that GEPA-generated prompts achieve the highest overall accuracy against three manual prompt sets in a Claude-3.5-Sonnet comparison, with median overall accuracy 0.59 versus 0.49, 0.38, and 0.42 for the manual prompts. The clearest gains are in Random Sequence Generation and Selective Reporting, where GEPA improves by approximately 0.30–0.40 over the best manual prompts; its clearest failure domain is Incomplete outcome data, where GEPA mean accuracy is 0.39 against manual-prompt scores of 0.55, 0.68, and 0.63 (Li et al., 1 Dec 2025). The paper emphasizes that GEPA’s strongest advantages are transparency and reproducibility as much as raw accuracy.
The OpenACC study evaluates GEPA-optimized prompts on the PolyBench suite and reports large improvements in compilability, especially for small models. Compilation success rises from 66.7% to 93.3% for GPT-4.1 Nano and from 86.7% to 100.0% for GPT-5 Nano, while overall compilability across 120 model-benchmark evaluations increases from 78.3% to 95.8% with zero regressions. The number of model-benchmark pairs with speedup 3 rises from 67 to 81, reported as a 21% increase (Jhaveri et al., 12 Jan 2026). The authors interpret the gains as evidence that prompt specialization can substitute for model capacity in syntax-sensitive systems tasks.
The medical-note paper reports similarly large prompt-optimization gains on MEDEC. GEPA raises GPT-5 from 0.669 to 0.785 on combined MS+UW accuracy and Qwen3-32B from 0.578 to 0.690. For GPT-5 specifically, MS-test rises from 0.720 to 0.816 and UW-test from 0.576 to 0.729, bringing the optimized system close to doctor-level performance and above one of the two reported doctor baselines on the combined score (Myles et al., 25 Feb 2026). The authors’ interpretation is that GEPA chiefly improves calibration by reducing false positives through more conservative task instructions.
Taken together, these studies indicate that GEPA is especially effective where failure traces are semantically legible and where improved behavior can be expressed as explicit high-level instructions. They also indicate that prompt evolution frequently discovers procedural coaching, error-avoidance heuristics, and domain-specific disambiguation rules, rather than merely elaborating on answer style.
6. Related genetic-Pareto methods, limitations, and controversies
GEPA belongs to a broader family of genetic and Pareto-based optimization methods, but its combination of language-mediated reflection and prompt evolution distinguishes it from classical evolutionary algorithms. Several neighboring literatures clarify both its lineage and its limits.
The feature-selection framework HeFS is described as “very close in spirit” to a GEPA-style method, but it searches conditionally in the residual feature space left by a baseline selector rather than optimizing prompts or solving the entire subset-selection problem from scratch. HeFS uses binary subset encoding, selection, single-point crossover, ratio-guided mutation, and Pareto-based multi-objective optimization over predictive performance and complementarity (Fan et al., 21 Oct 2025). This shows that “Genetic-Pareto” methods in the broader sense need not involve language at all; GEPA’s distinctiveness lies in reflective natural-language mutation rather than in Pareto search alone.
Other work illustrates the variety of Pareto-based evolutionary design choices. In fair feature selection, a modified NSGA-II Pareto GA is compared directly with a lexicographic GA, and the lexicographic method outperforms the Pareto GA in accuracy without degrading fairness (Brookhouse et al., 2023). In crystal structure prediction, ParetoCSP integrates genotypic age as an explicit objective within an NSGA-III-based multi-objective genetic algorithm guided by M3GNet, achieving an overall exact-structure success rate of 74.55% versus 29.091% for default GN-OA with MEGNet on 55 benchmark structures (Omee et al., 2023). In variational quantum eigensolver design, MoG-VQE uses NSGA-II to minimize both optimized energy and CNOT count and reports nearly ten-fold reductions in two-qubit gate counts relative to standard hardware-efficient ansätze (Chivilikhin et al., 2020). These examples show that Pareto evolutionary search is an established methodology across scientific domains, but they also underscore that GEPA’s innovation is not Pareto optimization per se; it is the use of language as both representation and adaptation medium.
Within the GEPA literature itself, several limitations recur. The scientific-reasoning paper explicitly notes the absence of a full formal definition of Pareto dominance, objective aggregation, archive size, or stopping details in its custom variant (Pandey et al., 30 Mar 2026). The RoB paper states that GEPA makes prompting auditable, but also acknowledges that heuristic multi-objective search can yield distinct optima even under fixed seeds and deterministic decoding (Li et al., 1 Dec 2025). The medical-note study reports substantial gains for most models but negative or negligible gains for Qwen3-1.7B and Qwen3-0.6B, suggesting that prompt evolution may over-specialize or simply fail for very small models (Myles et al., 25 Feb 2026). The OpenACC work identifies a Performance-Correctness Gap: optimized prompts can be overly conservative, improving suite-wide robustness while sacrificing peak performance on kernels where aggressive directives happen to work well (Jhaveri et al., 12 Jan 2026).
A more conceptual controversy concerns what exactly the Pareto frontier is preserving. In the original GEPA paper, the frontier is defined over instance-wise elite participation rather than over a classical continuous objective vector (Agrawal et al., 25 Jul 2025). In application papers, the Pareto mechanism is sometimes only partially specified (Pandey et al., 30 Mar 2026). This suggests that “Pareto” in GEPA has become both a technical term and a methodological commitment to non-collapse, diversity preservation, and retention of complementary prompt strategies. Whether future work will standardize these variants into a single formal framework remains open.
The current literature therefore supports two complementary interpretations. GEPA is, in a narrow sense, a specific reflective prompt optimizer introduced for compound AI systems. It is also, in a broader sense, a family of prompt-evolution procedures that use natural-language critique plus Pareto retention to search instruction space. Its scientific significance lies as much in what it reveals about model-facing reasoning structures and prompt brittleness as in the task-performance gains it can deliver.