LEVI: Stronger Search Architectures Can Substitute for Larger LLMs in Evolutionary Search

Published 10 May 2026 in cs.NE and cs.AI | (2605.09764v1)

Abstract: LLM-guided evolutionary methods such as AlphaEvolve have proven effective in domains like math, systems research, and algorithmic discovery, but their reliance on frontier models makes each run expensive. We argue this is largely an artifact of how existing frameworks allocate search: archives that fail to preserve solution diversity force compensation through stronger mutation models; blind model use spends frontier dollars on local edits a smaller model could handle; and full-set evaluation wastes rollouts on redundant examples. We introduce LEVI, a harness-first evolutionary framework built on the bet that stronger search architectures can substitute for or even outperform larger LLMs in evolutionary search. LEVI improves on three core components of evolutionary search: a solution database that establishes diversity from the beginning, and then maintains it throughout the run; a smarter mutation router that plays into the strengths of large and small LLMs; and a rank-preserving proxy benchmark for rollout-heavy settings. Across systems-research benchmarks LEVI attains the highest score on a budget 3.3-6.7x smaller than the published frontier-model runs of existing frameworks like ShinkaEvolve, GEPA, and AdaEvolve; on one problem, LEVI matches the existing best at a 35x lower cost. On prompt optimization, LEVI matches or exceeds GEPA at less than half of its rollout budget on four different benchmarks. LEVI is available as an open-source framework at https://github.com/ttanv/levi.

Abstract PDF Upgrade to Chat

Authors (1)

Temoor Tanveer

Summary

The paper presents LEVI, a harness-first evolutionary search framework that substitutes large LLM calls with a diverse archive and role-aware mutation routing to lower costs.
The paper shows that routing about 90% of mutations to fast models and using expensive calls selectively cuts evaluation costs by up to 35x while maintaining competitive performance.
The paper validates LEVI on seven systems benchmarks and four prompt optimization tasks, proving that tailored search architectures can outperform brute-force LLM scaling in evolutionary search.

LEVI: Harness-First Search Architectures as an Alternative to Large LLMs in Evolutionary Optimization

Motivation and Context

LLM-guided evolutionary search has demonstrated substantial power across a spectrum of domains requiring easy verification but challenging search, such as combinatorial optimization, systems and database tuning, code and prompt synthesis, and algorithmic discovery. However, the heavy reliance on frontier-scale LLMs (e.g., GPT-5, Claude Opus, Gemini Pro) has become a bottleneck, inflating experimental iteration cost and rendering such methods inaccessible to many practitioners. The work titled "LEVI: Stronger Search Architectures Can Substitute for Larger LLMs in Evolutionary Search" (2605.09764) presents a harness-first evolutionary search framework, LEVI, that directly targets cost and efficiency limits intrinsic to search architecture rather than focusing on scaling the mutation LLM.

Framework Architecture

LEVI departs from prior art by systematically addressing three axes of inefficiency in LLM-guided evolutionary search: archive initialization and diversity preservation, mutation routing and model selection, and evaluation budget allocation in rollout-heavy settings.

1. Solution Archive with Early and Maintained Diversity:

Instead of seeding from a single point, LEVI initializes a structurally diverse candidate set. It then constructs a CVT-MAP-Elites archive, with centroids calibrated based on both input-side (AST features) and output-side (behavioral) descriptors. Online z-score normalization ensures that no descriptor type dominates in high-dimensional search, and diversity is maintained by only admitting candidates to their closest Voronoi cell when they outperform existing cell elites. This approach yields robust coverage of algorithmic families and prevents premature archive collapse.

2. Role-Aware LLM Mutation Routing:

LEVI distinguishes between local refinement mutations (handled predominantly by fast, inexpensive models such as Qwen3-30B-A3B) and paradigm-shifting structural mutations (delegated to higher-capacity, higher-cost models such as Gemini 3 Flash). Around 90% of mutation calls are routed to smaller models, with large-model interventions only triggered upon stagnation or at fixed intervals. This leverages model capacity where it is algorithmically necessary, suppressing excessive expenditure on trivial edits.

3. Proxy-Benchmark Selection for Rollout Efficiency:

To minimize redundant evaluations in prompt optimization and other rollout-heavy domains, LEVI constructs a compact proxy dataset during initialization. A greedy forward selection optimizes both rank-faithfulness and candidate-separation, while penalizing redundancy. Candidate selection and survival depend on ordering signal transfer, ensuring that evolutionary selection pressure is preserved while reducing the number of rollout calls, thus efficiently balancing between calibration and exploration.

Empirical Evaluation

LEVI demonstrates state-of-the-art cost-efficiency and competitive or superior raw optimization scores on a suite of seven systems research (ADRS) benchmarks and four established prompt-optimization testbeds.

On systems research problems (e.g., transaction scheduling, load balancing, SQL context reordering, model placement), LEVI attains the highest final scores on six out of seven tasks. Its evaluation cost is 3.3 to 6.7 times lower than published results employing expensive SOTA models. On LLM-SQL, LEVI matches top baselines at a 35x lower cost.
In prompt optimization, LEVI matches or exceeds the performance of GEPA at less than half the rollout budget across HotpotQA, IFBench, Hover, and PUPA.

Ablation studies indicate the following:

Bootstrapped Diversity Initialization is Critical: Removing diverse seeding reduces achievable performance and leads to stagnation in archive exploration, especially for multi-family optimization tasks.
Descriptor Richness Influences Archive Coverage: Input-side AST-based features suffice for single-family tasks but underperform for tasks benefiting from broader behavioral exploration.
Role-Aware Model Routing is Beneficial Task-Dependently: Regimes dominated by local improvements may forgo large-model invocations, but tasks requiring structural jumps (e.g., EPLB) see stagnation without paradigm-shift calls. The optimal budget-split is task-specific, but the harness design enables exploitation wherever cost-effective.
Proxy-Benchmark Selection Preserves Selection Signal: The CSS+mean subset selection method consistently outperforms k-medoids and random-subset-ridge alternatives in preserving ranking faithfulness under constrained evaluation budgets, validated by high Spearman correlations between proxy-based and full-evaluation candidate rankings.

Theoretical and Practical Implications

The results attest that advanced search harness designs—especially those prioritizing archive-driven diversity and asynchronous, role-aware orchestration—can more than compensate for the lack of maximal mutation model capacity in evolutionary optimization contexts. This shifts the focus of evolutionary design from scaling up the LLM to engineering principled and adaptive search infrastructure.

For practitioners, this lowers the barrier to effective LLM-driven optimization by reducing both required compute and financial outlay per run. Systems and prompt design, code synthesis, or other hard search scenarios previously requiring SOTA APIs and large budgets are now tractable with locally-served, open-weight models for the majority of calls.

On a theoretical level, the work exposes the non-trivial interplay between search architecture and model capacity. It demonstrates that for many 'hard search, easy verify' problems, most value arises not from maximal mutation intelligence but from systematic diversity maintenance, efficient mutation allocation, and budget-sensitive evaluation.

Limitations and Future Directions

LEVI’s cost advantages are most pronounced when evaluations are relatively cheap. In settings where the cost per evaluation is high (e.g., model retraining as the fitness function), increased iteration count could erase much of the savings. The wall-clock time cost, although potentially reduced by asynchronous orchestration, is not fully characterized. While LEVI typically uses a small fraction of SOTA LLM calls, future directions include exploring settings reliant only on open-weight, local models and integrating online allocation strategies for paradigm shift triggers.

Conclusion

LEVI provides compelling evidence that search architecture innovation can substantially outpace brute-force LLM scaling in evolutionary optimization. It offers an open, extensible framework that decouples core efficiency axes in evolutionary search, provides strong sample and cost efficiency, and achieves or exceeds SOTA results on diverse tasks while shifting the computational and financial burden from LLM inference to smarter harness design. This paradigm suggests a fruitful direction for robust, democratized optimization in LLM-guided discovery (2605.09764).

Markdown Report Issue