LLM-EPS: Evolutionary Program Search

Updated 17 November 2025

LLM-EPS is a framework that combines LLMs and evolutionary algorithms to automatically synthesize and optimize programs.
It employs a bilevel optimization approach where candidate programs are evolved based on empirical performance in tasks such as feature engineering and combinatorial optimization.
Empirical results demonstrate that LLM-EPS outperforms traditional methods, achieving higher accuracy and faster convergence in various benchmark tasks.

Evolutionary Program Search with LLMs (LLM-EPS) refers to a class of frameworks that synthesize, optimize, and select programs or heuristics by combining the deductive reasoning and generative capacity of LLMs with evolutionary algorithms (EAs). The methodology departs from pure LLM sampling or classical genetic programming by tightly integrating LLMs as variation operators within an explicit evolutionary optimization loop, leveraging feedback from prior candidates, diversity preservation, and domain knowledge. This paradigm has been instantiated across feature engineering, algorithm/heuristic discovery, program synthesis, and policy search, consistently outperforming static LLM prompting and traditional EAs on a range of symbolic, combinatorial, and tabular problem classes.

1. Formalization of LLM-EPS and the Bilevel Search Objective

The foundation of LLM-EPS is the formulation of program discovery as a bilevel optimization over a search space of programs $\mathcal{T}$ , guided by empirical performance on downstream tasks. In the context of automated feature engineering (Abhyankar et al., 18 Mar 2025), let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ be a dataset split into training $(X_{\mathrm{tr}}, Y_{\mathrm{tr}})$ and validation $(X_{\mathrm{val}}, Y_{\mathrm{val}})$ sets. Candidate programs $\mathcal{T}$ , each representing a feature transformation pipeline as Python code or an equivalent abstract syntax tree (AST), are applied to generate augmented data representations.

Given a downstream predictor $f$ (e.g., XGBoost, MLP), with training loss $\mathcal{L}_f$ and validation metric $\mathcal{E}$ , bilevel LLM-EPS seeks

$f^* = \arg\min_f \mathcal{L}_f(f(\mathcal{T}(X_{\mathrm{tr}})), Y_{\mathrm{tr}})$

$\max_{\mathcal{T}} \mathcal{E}(f^*(\mathcal{T}(X_{\mathrm{val}})), Y_{\mathrm{val}})$

Hence, the outer loop searches over $\mathcal{T}$ —the program space—while the inner loop retrains $f^*$ for each candidate. This structure generalizes to other domains such as combinatorial optimization and program synthesis, where $\mathcal{T}$ represents parameterized policies, algorithms, or synthesis templates, and the evaluation function is problem-specific (Yepes et al., 9 May 2025, Surina et al., 7 Apr 2025).

Programs in LLM-EPS are typically represented as code snippets or ASTs compatible with standard ML libraries (e.g., pandas DataFrame transforms, DEAP GP trees), maintaining a direct mapping between LLM output and executable individuals in the evolutionary population (Abhyankar et al., 18 Mar 2025, Yepes et al., 9 May 2025).

2. Evolutionary Optimization Protocols and LLM Integration

LLM-EPS instantiates the evolutionary loop by injecting LLM-generated variation within a population-based search, combined with explicit selection and replacement policies. The canonical LLM-EPS cycle comprises:

Initialization: Populations are seeded using easily parseable LLM completions or a pool of hand-crafted simple programs (e.g., ratios, logs) (Abhyankar et al., 18 Mar 2025), or via multiple LLM calls until a diverse set of valid programs is obtained (Yepes et al., 9 May 2025).
Selection: Previous high-scoring candidates are stored in memory buffers (e.g., distributed into $m$ “islands”), with in-context examples sampled for prompting by Boltzmann-weighted probabilities over validation scores. This mechanism injects evolutionary selection pressure while providing diverse in-context references to the LLM (Abhyankar et al., 18 Mar 2025, Surina et al., 7 Apr 2025).
Variation: New candidate programs are produced by stochastic LLM sampling. Variation comprises several forms:
- Implicit mutation and crossover achieved by prompting the LLM with prior top-performing programs (either a single parent for mutation or multiple for “crossover”-style conceptual recombination).
- Stochasticity is controlled via decoding parameters such as temperature and nucleus sampling, often with additional domain-specific instructions in the prompt.
Evaluation and Replacement: Each candidate is compiled/executed and scored (metric $\mathcal{E}$ or equivalent task reward). Candidates that improve upon the buffer’s worst score are inserted, preserving diversity by clustering or performance parity (Abhyankar et al., 18 Mar 2025, Yepes et al., 9 May 2025).
Termination: The search halts after a fixed number of iterations or convergence, outputting the highest-scoring program or ensemble.

High-throughput evaluation is achieved by compiling program representations into efficient C++/CUDA kernels, enabling population sizes of up to 30,000 and yielding a $10 \times$ speedup relative to naïve Python execution (Yepes et al., 9 May 2025). Pseudocode for this process is standardized across domains (see Algorithm 1 in (Abhyankar et al., 18 Mar 2025, Yepes et al., 9 May 2025)).

3. Prompt Engineering, Domain Knowledge Injection, and Reasoning

LLM-EPS leverages prompt engineering to harness both generic reasoning capabilities and problem-specific knowledge. Standard prompt templates encode:

Instructions: Explicit roles (“You are a feature engineering assistant...”) and directives to utilize domain knowledge.
Task Metadata: Column descriptions, target variable specification, and selected example rows serialized in natural language.
Evaluation Function: Pseudo-code for the validation procedure to provide grounding signal.
In-context Examples: Prior high-scoring programs and succinct rationales giving the LLM both code patterns and reasoning chains (Abhyankar et al., 18 Mar 2025).

Chain-of-thought (CoT) outputs enable the LLM to justify feature or program proposals, e.g., describing skew correction with $\log$ transforms for right-tailed distributions or domain-specific ratios in medical data.

Domain knowledge is operationalized through natural language descriptions, initial seeds designed by experts, and prompts such as “leverage your medical knowledge about cardiovascular risk” (Abhyankar et al., 18 Mar 2025). The prompt’s structure and fidelity to prior evolutionary successes critically influence LLM output relevance and quality.

4. Empirical Results and Comparative Assessment

Across feature engineering, combinatorial optimization, and program synthesis benchmarks, LLM-EPS establishes state-of-the-art performance relative to both traditional and LLM-centric baselines.

On 11 tabular classification and 10 regression datasets, LLM-EPS achieves mean rank 1.54 and 1.00, outperforming OpenFE, AutoFeat, CAAFE, and FeatLLM, with classification accuracy gains of $1$–$3$ points and regression N-RMSE reductions of $3$– $10\%$ (Abhyankar et al., 18 Mar 2025).
In synthetic list-transformation tasks (Count, Max/Min, Inverse, Sort), LLM-enhanced seeding and elite-guided mutation yield perfect or near-perfect accuracy with up to $4\times$ faster convergence and $30$– $50\%$ shorter programs than standard EAs (Yepes et al., 9 May 2025).
Case studies in domain-aware feature engineering demonstrate interpretability, e.g., proposing $\log(\text{Cholesterol}+1)$ based on data distribution, increasing XGBoost accuracy from $0.858$ to $0.866$ (Abhyankar et al., 18 Mar 2025).
Robustness to noise and scalability to datasets exceeding $100$ features and $250$k rows are demonstrated empirically.
Ablations reveal that both LLM-provided domain knowledge and the evolutionary refinement loop independently contribute over $1\%$ accuracy improvement (Abhyankar et al., 18 Mar 2025).

5. Design Principles, Trade-offs, and Generalization

Key principles emerging from LLM-EPS research include:

LLMs as Variation Operators: LLMs serve as black-box mutators/crossovers to propose complex, domain-aligned program edits, guided by explicit in-context feedback and selection (Abhyankar et al., 18 Mar 2025, Yepes et al., 9 May 2025).
Buffer-based Selection and Boltzmann Sampling: Maintaining buffers of prior successes, clustering by performance, and Boltzmann-probability sampling inject selection and maintain diversity, addressing the tendency of LLMs to repeat patterns without evolutionary pressure (Abhyankar et al., 18 Mar 2025, Yepes et al., 9 May 2025).
Prompt and Evaluation Design: Structured, context-rich prompts paired with clear pseudo-code evaluation targets enable the LLM to reason incrementally over evolving search traces. Chain-of-thought rationales and example-driven reinforcement are essential for LLM-guided improvement.
Generalizability: The LLM-EPS paradigm—searching over code/programs via stochastic LLM sampling as variation, evolutionary selection, and validation-driven feedback—generalizes across domains: feature engineering, data cleaning code, model architecture search, code augmentation, and more (Abhyankar et al., 18 Mar 2025, Yepes et al., 9 May 2025).

Trade-offs include increased computational and latency costs per LLM call versus classical EAs, potential for hallucinated/invalid outputs (mitigated by static program validation or execution timeouts), and reliance on high-quality prompt engineering for scalability across domains.

6. Limitations and Future Directions

Limitations identified in current LLM-EPS instantiations include:

Toy Task Focus: Most empirical studies are conducted on synthetic or small-scale symbolic benchmarks (e.g., list manipulation, simple tabular data), limiting direct transference to large codebases or complex policies without further engineering (Yepes et al., 9 May 2025).
Output Validity and Hallucination: LLMs can emit syntactically invalid or semantically incoherent code, requiring retry loops, strict acceptance criteria, or guarded execution.
Resource Cost: Additional LLM inference calls increase API costs and search latency, though bulk evaluation can be amortized via high-throughput kernels (Yepes et al., 9 May 2025).
Limited Crossover and Adaptive Scheduling: Most implementations rely on LLM-driven mutation with limited explicit LLM-guided crossover. Adaptive LLM-call scheduling and attention to population diversity are proposed as future enhancements.

Proposed directions for development include:

Deep LLM Involvement in Crossover: Incorporating LLMs in repairing and recombining subtrees or code fragments.
Diversity-driven Scheduling: Triggering additional LLM calls when evolutionary progress stalls or population diversity decays.
Transfer to Real-world Program Synthesis Benchmarks: Extending LLM-EPS frameworks to MBPP, HumanEval, or large-scale automated code repair domains (Yepes et al., 9 May 2025).
Integration with Automated Reasoning Components: Combining LLM-EPS with symbolic verifiers, static analyzers, or interpretable code summarization pipelines for enhanced robustness.

LLM-EPS thus emerges as a general, effective, and extensible framework for automated program discovery—encompassing feature engineering, heuristics, symbolic reasoning, and end-to-end meta-optimization—by synergistically combining evolutionary principles with LLM reasoning and data-driven evaluation.

PDF Markdown Chat (Pro)

References (3)

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers (2025)

Evolutionary thoughts: integration of large language models and evolutionary algorithms (2025)

Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning (2025)

Follow Topic

Get notified by email when new papers are published related to Evolutionary Program Search (LLM-EPS).