EvoTrace: Replayable Evolutionary Coding Dataset

Updated 4 July 2026

EvoTrace is a replayable dataset and methodology that captures full evolutionary coding traces, detailing lineage, diffs, prompts, and evaluator outputs.
It records comprehensive search trajectories across multiple frameworks and tasks, providing transparent insights into algorithmic innovation mechanisms.
The approach enables controlled replays and causal interventions to distinguish between structural changes, tuning refinements, recombination, and overfitting effects.

Searching arXiv for the specified paper and closely related terms to ground the article in current literature. arxiv_search query: (Pelleriti et al., 19 May 2026) EvoTrace is a curated, replayable dataset of evolutionary coding traces paired with a methodology and toolkit, EvoReplay, for analyzing what evolutionary coding agents actually evolve rather than summarizing a run only by its best evaluator score (Pelleriti et al., 19 May 2026). It records the full search trajectory—programs generated at each step, their lineage, parent–child diffs, prompts, generator models, evaluator outputs, and replay environments—across four evolutionary frameworks, reasoning and non-reasoning models, and 16 tasks spanning mathematical discovery and algorithm design. Within this framing, a high score can arise from several qualitatively different mechanisms, including new algorithmic structure, re-tuning of an existing strategy, recombination of ideas already present in the model’s internal knowledge, or overfitting to the evaluator. EvoTrace is intended to make those mechanisms observable and testable rather than leaving them conflated in a single best-of-run number (Pelleriti et al., 19 May 2026).

1. Conceptual scope and motivation

EvoTrace was introduced to answer a specific diagnostic question: what, concretely, do evolutionary coding agents evolve (Pelleriti et al., 19 May 2026). The motivating observation is that recent systems combining LLMs with evolutionary search can produce strong outcomes in mathematical discovery and algorithm design, yet standard reporting usually emphasizes only the best score reached under a task-specific evaluator. That summary statistic is insufficient because it does not identify the mechanism responsible for the gain.

The framework distinguishes four mechanisms that can be conflated by a single top score. New algorithmic structure denotes replacing core procedures or adding new algorithmic phases. Re-tuning of existing strategies denotes changes to numeric literals, thresholds, learning rates, or budgets. Recombination denotes composing previously present ideas or code paths into new hybrids. Overfitting to the evaluator denotes exploiting quirks of public test sets or judge interfaces without generalizing (Pelleriti et al., 19 May 2026).

This diagnostic perspective is central to EvoTrace’s design. It logs a unified, replayable record of the whole search; annotates every code edit with a nine-category taxonomy of edit types; and provides controlled replays and interventions to measure causal effects on score. A recurring misconception in the surrounding area is that a benchmark improvement directly evidences algorithmic innovation. EvoTrace is explicitly organized to test that claim rather than assume it (Pelleriti et al., 19 May 2026).

2. Dataset composition, trace collection, and schema

EvoTrace spans 16 tasks across two language–domain pairs: 6 Python mathematical-discovery tasks and 10 C++ competitive programming problems from ALE-bench Lite, specifically AtCoder Heuristic Contest problems (Pelleriti et al., 19 May 2026). The task split is itself diagnostic. The math tasks reward discovering new algorithmic structure, whereas the ALE tasks require compiling C++ programs and passing external judging with limited evaluator access, making overfitting a realistic risk.

The dataset covers four evolutionary coding systems—OpenEvolve, GEPA, EvoX, and ShinkaEvolve—and five generator LLMs: deepseek-reasoner, claude-sonnet-4-6, claude-haiku-4-5, gemini-3-flash-preview, and deepseek-chat (Pelleriti et al., 19 May 2026). Both “diff-based generation,” in which models emit unified diffs, and full-file rewriting modes are represented. Trace inclusion is restricted by a replayability criterion: traces that cannot be re-run against the original evaluator are excluded.

At the reported scale, EvoTrace contains 121 evolutionary runs, each with 100 iterations, comprising 10,672 unique programs, 10,479 parent→child edits, and 18,400 LLM calls. Token usage totals 274.7M prompt tokens, 80.3M completion tokens, and 42.8M “reasoning” tokens. Per-run medians are 101 programs, 99 edits, 134 LLM calls, approximately 1.7M prompt tokens, and approximately 0.54M completion tokens (Pelleriti et al., 19 May 2026).

Each run records the propose–evaluate–select cycle in a normalized JSONL schema. The trace includes the specific model, prompt, and any retrieved context used to produce a child; a unified diff between parent and child source plus the full child program’s source as a byte-identical record; the parent–child graph and operator used; evaluator logs, errors, timings, task-specific metrics, and scalar score; and environment metadata such as evaluator command, dependencies, timeouts, hardware assumptions, and seeds (Pelleriti et al., 19 May 2026).

Object type	Representative fields	Role
`runs.jsonl`	`run_id`, `task_id`, `framework`, `model_config`, `evaluator_cmd`	Slice analyses and preserve evaluator reproducibility
`candidates.jsonl`	`candidate_id`, `iteration`, `language`, `source_code`, `parent_ids`, `score`	Preserve full source for replay and literal extraction
`edges.jsonl`	`parent_id`, `child_id`, `op_type`, `prompt_id`, `model_id`	Reconstruct lineages and operator effects
`evaluations.jsonl`	`candidate_id`, `score`, `metrics`, `logs`, `runtime_ms`, `status`	Capture raw evaluator behavior beyond scalar score
`contexts.jsonl`	`system_prompt`, `user_prompt`, `retrieved_exemplars`, `population_summary`, `lineage_snippet`	Enable same-prompt replay and context substitution
`replay_env.jsonl`	`docker_image` or `conda_env`, `dependencies`, `timeouts`	Make candidates re-executable under original conditions

EvoReplay synthesizes a parent–child unified textual diff for every edge, enabling line-level analyses across languages without assuming AST transforms. This design choice supports cross-language comparability while also enabling literal extraction for Bayesian optimization and deterministic cycling detection (Pelleriti et al., 19 May 2026).

3. Edit taxonomy and annotation methodology

Every edit in EvoTrace is multi-label annotated with one or more of nine recurring edit types, defined inductively from trace inspection and scaled through an LLM-as-judge pipeline (Pelleriti et al., 19 May 2026). The taxonomy is intended to describe the operative character of a code change rather than only its textual surface.

The nine categories are as follows. Hyperparameter tuning covers changes to numeric constants or configuration values without altering control flow. Local refinement covers targeted adjustment inside an existing routine while preserving role and structure. Architectural change covers replacement of a core algorithmic block with a substantively different approach. Composition covers addition of a new component, operator, or branch alongside existing logic, with the old path retained. Efficiency covers behavior-preserving reductions in asymptotic or constant cost, such as vectorization, batching, partial sorts, or caching. Bug fix covers correction of latent defects such as wrong sign, off-by-one, missing guard, or mishandled sentinel. Pruning covers removal of a code path or phase while leaving the rest of the program intact. Refactor covers behavior-preserving restructuring. External dependency covers addition or removal of an import or external library (Pelleriti et al., 19 May 2026).

The annotation pipeline submits each parent–child diff with a structured prompt and returns a JSON list of labels and the lines judged to drive the score change. The package handles batching, retries, and schema validation; annotations are cached and can be regenerated with a different judge. Validation against blind human re-annotation on a stratified sample of $n=200$ edits yielded exact-match accuracy 74.5%, micro-F1 0.90, and macro Cohen’s $\kappa = 0.77$ (Pelleriti et al., 19 May 2026).

A notable characteristic of the dataset is that most edits are multi-label. Specifically, 67.4% have two or more labels, while only 32.4% are single-label (Pelleriti et al., 19 May 2026). This indicates that the categories encode overlapping mechanisms rather than mutually exclusive types. A plausible implication is that many score changes arise from coupled structural and parametric modifications rather than isolated edit primitives.

EvoTrace also introduces a deterministic cycling classifier based on lineage-local line reuse. Added lines are checked against previously deleted lines in the same lineage and classified as literal recycling when the reintroduction is byte-identical, tuning recycling when the same line skeleton reappears after collapsing numeric literals into placeholders, and trivial recycling when the match is comment- or whitespace-only (Pelleriti et al., 19 May 2026).

4. EvoReplay: replayable local search states and interventions

EvoReplay treats a candidate program at edge $t$ as an executable artifact attached to a precise local search state,

$S_t = \{p_t, P_t, op\_type_t, C_t, m_t, \text{sampling params}, f, E\},$

where $p_t$ is the candidate program, $P_t$ the parent set, $C_t$ the prompt and context, $m_t$ the model, $f$ the evaluator, and $E$ the replay environment (Pelleriti et al., 19 May 2026). Because EvoTrace stores all of these components, the methodology can re-execute, perturb, or substitute one factor while holding others fixed.

The controlled interventions are designed for causal attribution. Same-prompt replay re-runs the saved context $\kappa = 0.77$ 0 with the same model $\kappa = 0.77$ 1, or with a substituted model, multiple times and reports parse success, evaluation success, score distribution, and exact-program match rate. Hyperparameter retuning decomposes a program as $\kappa = 0.77$ 2, with structure $\kappa = 0.77$ 3 and exposed hyperparameters $\kappa = 0.77$ 4, and defines the fixed-structure tuning ceiling

$\kappa = 0.77$ 5

The tuning gap is then

$\kappa = 0.77$ 6

Implementation uses one LLM pass to identify tunable literals and ranges, rewrites the program to expose a top-level PARAMS block in Python or #define macros in C++, and runs skopt.gp_minimize for 24 evaluator calls, specifically 8 random plus 16 BO acquisitions (Pelleriti et al., 19 May 2026).

Further interventions include ablation and pruning tests, which remove or disable newly added components while holding evaluator and seed constant; context and model substitution, which keep parent code and evaluator fixed while changing the generator model or prompt/context; and static analyses such as lineage depth, budget utilization, hyperparameter-literal counts, and cycling detection (Pelleriti et al., 19 May 2026).

The framework is explicit about what remains fixed and what changes. The evaluator $\kappa = 0.77$ 7, replay environment $\kappa = 0.77$ 8, and original prompts or contexts $\kappa = 0.77$ 9 can be held constant for exact same-prompt tests, while the single factor under study—such as $t$ 0, a code block, the model, or retrieved context—is varied (Pelleriti et al., 19 May 2026). This design makes the methodology closer to controlled experimental replay than to ordinary post hoc log inspection.

Math and ALE tasks are handled differently at the evaluator layer. Math programs are executed under objective-specific evaluators returning scalar scores together with full logs and runtime metadata. ALE solutions compile and are scored by the AtCoder judge harness, with the important caveat that public scores differ from private test-set scores used for final rating; EvoTrace therefore re-scores public best-so-far chains on the private test set to detect overfitting (Pelleriti et al., 19 May 2026).

5. Empirical findings and their significance

The principal findings from EvoTrace and EvoReplay are diagnostic rather than purely benchmark-oriented (Pelleriti et al., 19 May 2026).

Edit-type concentration of gains: Hyperparameter tuning dominates the search distribution, but the most helpful edit types per edit, measured by odds ratio for positive normalized score change, are External dependency with OR 3.58 and $t$ 1, Efficiency with OR 1.61 and $t$ 2, and Architectural change with OR 1.55 and $t$ 3. Successful trajectories, including best-so-far updates and final-best lineages, are enriched in Efficiency, External dependency, and Hyperparameter tuning relative to base rates. The stated implication is that search spends most of its budget on common but less helpful edit types, while the categories that move the needle are rarer.

Deterministic cycling: Across all 121 runs, the median share of added lines that are byte-identical re-introductions of previously deleted lines is about 30%. The cycling rate increases monotonically over the run in 118 of 121 cases, with median per-iteration slope $t$ 4. The median span between deletion and re-introduction is 5 iterations, and the pattern is reported as stable across frameworks, languages, and all 5 generator models. Under the refined classifier, tuning recycling constitutes a median 8% of code-changing lines and varies by model and prompting mode (Pelleriti et al., 19 May 2026).

Reproducibility of breakthroughs: Same-prompt replay over 36 best-so-far events, with 10 resamples each, yields median parse success 1.00 and median evaluator success 1.00, but median exact byte-identical reproduction 0.00. The median replayed score relative to the original is 0.76. The interpretation given in the paper is that structural gains are carried by the prompt and context, while the exact code is one draw from a broader distribution (Pelleriti et al., 19 May 2026).

Parametric recoverability on math tasks: Bayesian optimization improved 22 of 36 intermediate programs over their original scores. When comparing BO on intermediate programs to the evolutionary run’s final-best, BO matched or exceeded the final-best on 13 of 15 probed cases, with median delta $t$ 5. The largest cited case is heilbronn_tri_dsr_nodiff, where the evolution final-best is 0.521 and BO on an intermediate program reaches 0.886, or 1.70× the final-best. The reported implication is that, on math tasks, much of the late-run improvement is parametric; early structural changes matter, but tuning can often catch up (Pelleriti et al., 19 May 2026).

Lineage depth and budget use: Median lineage depth from final best to seed is 6 for math and 4 for ALE. Best-so-far is found relatively early on ALE, at median 0.49 of the run, versus 0.75 on math. The paper characterizes the dominant pattern as “jackpot-then-flat,” with most budget spent exploring dead branches (Pelleriti et al., 19 May 2026).

Overfitting in ALE: Re-scoring public best-so-far chains on AtCoder’s private test set across 30 run/problem pairs and 10 ALE problems shows that two of the four frameworks overfit on at least 30% of scored problems. The same problem can flip between aligned improvement and severe overfit across frameworks, exemplified by ahc024. Reported per-framework alignment is evox 0/8 aligned, gepa 1/7, openevolve 4/9, and shinka 4/6 (Pelleriti et al., 19 May 2026). This directly qualifies any reading of public-score improvement as evidence of robust algorithmic gain.

6. Availability, limitations, and terminological scope

EvoTrace is released on Hugging Face and contains JSONL tables for runs, candidates, edges, evaluations, contexts, and replay environments, together with full source for each program and evaluator metadata and logs (Pelleriti et al., 19 May 2026). EvoReplay is released on GitHub, built on SkyDiscover, and provides schema normalization, per-edit unified diffs, LLM-as-judge annotation, deterministic cycling classification, same-prompt replay, a Bayesian optimization wrapper, ablation and substitution utilities, and visualization scripts. Reproduction consists of hydrating the replay environment for a chosen run, loading the trace tables, running the provided notebooks or scripts for static analysis, annotation, and replay, and comparing re-executed scores to recorded ones (Pelleriti et al., 19 May 2026).

The paper identifies several limitations. Coverage is restricted to 16 tasks in two domains across four frameworks and five models, so results may differ for other domains such as GPU kernels or large-scale systems code. Evaluator biases remain important: ALE public judges can misrepresent private performance, and math evaluators differ in how many tunable knobs are exposed, complicating cross-framework headline comparisons. The LLM-as-judge pipeline, although validated at $t$ 6 and micro-F1 $t$ 7, is not perfect, and external_dependency can be difficult. Line-level textual diffs preserve portability but miss AST-level intent, and current replays focus on local search states rather than broader interventions such as selection-policy changes (Pelleriti et al., 19 May 2026).

The proposed directions follow directly from those limitations. They include extending the fixed-structure tuning ceiling $t$ 8 to more domains and evaluator types; expanding replays to richer perturbations such as selection rules, novelty filters, and deletion-aware credit assignment; incorporating behavioral-similarity measures to separate genuine algorithmic innovation from superficial syntactic changes; and pairing public leaderboards with private or cross-split re-scoring to detect and discourage overfitting (Pelleriti et al., 19 May 2026).

A terminological qualification is warranted. In the supplied literature, “EvoTrace” is the explicit name of the evolutionary-coding dataset and replay methodology just described (Pelleriti et al., 19 May 2026). The term also appears informally in a synthesized “EvoTrace view” of an Eulerian trace finite element method for PDEs on evolving surfaces, but that work is formally titled “A trace finite element method for PDEs on evolving surfaces” and develops a distinct numerical method for surface transport–diffusion equations on evolving implicit surfaces (Olshanskii et al., 2016). The two usages are unrelated in subject matter: one concerns replayable traces of evolutionary coding agents, while the other concerns trace FEM on evolving surfaces.

Markdown Report Issue Upgrade to Chat

References (2)

What Do Evolutionary Coding Agents Evolve? (2026)

A trace finite element method for PDEs on evolving surfaces (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvoTrace.