Papers
Topics
Authors
Recent
Search
2000 character limit reached

EvoReplay: Evolutionary and MARL Replay Dynamics

Updated 4 July 2026
  • EvoReplay is a replay-based framework that reconstructs evolutionary coding trajectories to differentiate between structural innovations and parameter tuning.
  • In multi-agent reinforcement learning, it implements experience replay with revision protocols to realize innovative dynamics like BNN and Smith dynamics.
  • It also informs ERC-aware on-policy RL by applying stabilization tricks that control distribution shifts and mitigate negative update biases.

Searching arXiv for the cited papers to ground the article in the current literature. EvoReplay denotes a replay-based methodological label that appears in distinct technical contexts in recent arXiv work. In evolutionary coding, it is the experimental layer built on top of EvoTrace to answer the question of what evolutionary coding agents evolve as program scores improve over time; it reconstructs local search states behind score jumps and replays, perturbs, and retunes them under controlled interventions (Pelleriti et al., 19 May 2026). In multi-agent reinforcement learning, EvoReplay refers to the same algorithmic idea as Experience-replay Innovative Dynamics (ERID), an experience replay-based update rule that uses revision protocols to realize innovative evolutionary dynamics such as Brown–von Neumann–Nash and Smith dynamics (Zhang et al., 21 Jan 2025). A related replay-centered line formalizes “experience replayable conditions” and derives stabilization tricks that make experience replay applicable to Advantage Actor-Critic in on-policy settings (Kobayashi, 2024).

1. Terminological scope and shared structure

The term is therefore not tied to a single invariant algorithm. In one usage, EvoReplay is a diagnostic framework for analyzing evolutionary coding traces; in another, it is a policy-update mechanism for MARL grounded in evolutionary game theory. A related body of work on experience replayable conditions supplies a stricter criterion for when replay can be used without inducing instability or bias in on-policy learning (Pelleriti et al., 19 May 2026).

Usage Primary object Core mechanism
EvoReplay in evolutionary coding Search traces over candidate programs Replay, perturbation, retuning, ablation
EvoReplay as ERID in MARL Policy simplex over actions Experience replay with revision protocols
ERC-aware replay design Replayed transitions in on-policy RL Counteraction and mining of indistinguishable experiences

A shared structural motif is that replay transforms logged trajectories into objects of intervention. In evolutionary coding, replay is attached to the original evaluator harness so that alternative edits, constants, model substitutions, and prompt contexts remain directly comparable to the recorded trajectory. In MARL, replay supplies empirical average rewards that substitute for fitness in revision protocols, allowing discrete-time updates to track continuous-time innovative dynamics. This suggests a family resemblance: replay is used either diagnostically, to attribute gains and failure modes, or algorithmically, to shape the policy dynamics themselves.

2. EvoReplay in evolutionary coding: objectives, data substrate, and workflow

In “What Do Evolutionary Coding Agents Evolve?”, EvoReplay is explicitly diagnostic rather than purely performance-oriented. Its objective is to distinguish structural innovations from parametric tuning, measure recombination and refactoring versus net helpfulness, detect deterministic cycling, and separate evaluator-overfitting from generalization (Pelleriti et al., 19 May 2026). The motivation is that a headline best-in-run score may conflate genuine algorithmic innovation, re-tuning of known strategies, recombination of ideas the model already “knows,” and evaluator-specific overfitting.

The framework operates on EvoTrace, a unified dataset of evolutionary coding runs normalized to a replayable schema. Each candidate program is stored with its full source, its parent(s), an exact unified diff to its parent, the LLM prompt and retrieved context used to generate it, and the complete evaluator outputs, including logs, metrics, and errors. EvoTrace comprises 121 runs across four evolutionary frameworks—OpenEvolve, GEPA, EvoX, and ShinkaEvolve—over 16 tasks spanning 6 Python mathematical discovery problems and 10 C++ ALE-bench competitive programming tasks. Five LLMs generate mutations: deepseek-reasoner, claude-sonnet-4.6, claude-haiku-4.5, gemini-3-flash-preview, and deepseek-chat. Each run consists of 100 iterations, for an aggregate scale of 10,672 unique programs, 18,400 LLM calls, and 274.7M prompt tokens.

EvoReplay’s methodology has four capabilities. Static analysis of traces normalizes parent→child edits into a table of parent, child, prompt, score, and unified diff, enabling measures such as lineage depth, best-so-far timing, program length, numeric-literal counts, and deterministic cycling detection. LLM-as-judge edit annotation applies a nine-type taxonomy to every edit with batching, schema validation, caching, and re-annotation under alternative judge models. Bayesian optimization over a single program’s exposed constants estimates how much of an evolutionary gain is recoverable by tuning alone. Replay stability of breakthroughs re-executes the saved generating prompt for a candidate under the original or a substituted model and summarizes outcomes as parse success, evaluation success, and score conditional on success.

Algorithmically, the package is built on SkyDiscover, ingests EvoTrace JSONL tables, and assumes replayability under the original evaluator, byte-identical prompts and sources, deterministic diffing, and scalar or comparable evaluator scores. Typical 100-iteration runs produce about 100 programs, about 134 LLM calls, and about 2.2M tokens, while replay interventions are comparatively modest: replay uses 10 resamples per target, Bayesian optimization uses 24 calls per target, and static analysis is negligible relative to run generation.

3. Formalization, notation, and controlled interventions

EvoReplay formalizes programs as executable artifacts pΠp \in \Pi and evaluators as scoring functions f:ΠRf: \Pi \to \mathbb{R}, or task-specific metrics convertible to a scalar fitness. A run induces a search graph with nodes for candidates and directed edges for parent→child edits. The local search state at iteration tt is

St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),

where ptp_t is the current program, CtC_t is the byte-identical prompt/context seen by the LLM, ff is the evaluator, EtE_t encodes the set of edit operations applied in the parent→child diff D(pt1,pt)D(p_{t-1}, p_t), and MtM_t denotes the generator model (Pelleriti et al., 19 May 2026).

For diff-based analysis, f:ΠRf: \Pi \to \mathbb{R}0 returns multisets of added and removed lines. EvoReplay defines a literal recycling indicator f:ΠRf: \Pi \to \mathbb{R}1, a tuning-recycling indicator f:ΠRf: \Pi \to \mathbb{R}2 that collapses numeric literals via a placeholder NUM, and a trivial-recycling indicator for whitespace or comment-only changes. The per-edit cycling rate is

f:ΠRf: \Pi \to \mathbb{R}3

summarized by the per-run median f:ΠRf: \Pi \to \mathbb{R}4 and linear-fit slope f:ΠRf: \Pi \to \mathbb{R}5 over f:ΠRf: \Pi \to \mathbb{R}6. For score attribution, an edit edge with score change f:ΠRf: \Pi \to \mathbb{R}7 receives a multi-label set f:ΠRf: \Pi \to \mathbb{R}8 from the edit taxonomy, and run-level contribution is written as f:ΠRf: \Pi \to \mathbb{R}9. Helpfulness is summarized through odds ratios rather than means because score deltas are heavy-tailed and failure modes are bimodal.

The structural-versus-parametric decomposition writes a program as tt0, with structure tt1 and exposed hyperparameters tt2. A single-program tuning ceiling is

tt3

implemented with 24 evaluator calls, specifically 8 random starts and 16 Bayesian-optimization acquisitions over a Gaussian-process surrogate. The tuning gap relative to a run’s final best is tt4.

Replay reproducibility is summarized by tt5, where tt6 is the fraction of replays producing runnable code, tt7 is the fraction that pass the evaluator, tt8 is the fraction matching the original program byte-for-byte, and

tt9

These abstractions support several interventions. Parametric retuning uses one LLM call to identify tunable numeric literals and rewrites the program to expose them through a PARAMS block or #define macros. Pruning and repair remove phases or guards, then retune or add missing guard checks and sentinel validation. Model or context substitution re-executes the exact saved context St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),0 under the original St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),1 or a substituted model St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),2. Label-guided transformations alter external dependencies or efficiency-critical primitives. The reported examples are concrete: on a Heilbronn placement program, BO on an intermediate structure recovered and exceeded the run’s final-best, 0.886 versus 0.521; deleting a final “global shake” phase while retuning annealing constants yielded St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),3; one ALE program gained +56.2 rating points by checking INF_COST and function success; and introducing jax and optax for a numerical optimizer yielded St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),4.

4. Empirical findings in evolutionary coding

EvoReplay’s central empirical contribution is to show that score improvement mechanisms are heterogeneous and not reducible to end-of-run best fitness (Pelleriti et al., 19 May 2026). The edit taxonomy contains nine recurring edit types: Hyperparameter tuning, Local refinement, Architectural change, Composition, Efficiency, Bug fix, Pruning, Refactor, and External dependency. Edits are typically multi-label: 67.4% have at least two labels, 52.4% have exactly two, and 32.4% are single-label. The LLM-as-judge pipeline is validated against blind human re-annotation on 200 parent→child edits with macro Cohen’s St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),5, micro-F1 = 0.90, and exact-match accuracy 74.5%.

By frequency, Hyperparameter tuning dominates the search distribution. By per-edit helpfulness, however, External dependency has odds ratio 3.58× with St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),6, Efficiency has 1.61× with St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),7, and Architectural change has 1.55× with St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),8. Best-so-far updates and final-best lineages are enriched in Efficiency, External dependency, and Hyperparameter tuning relative to the base rate. The resulting frequency–utility gap indicates that evolutionary systems spend most of their effort on edits that rarely drive large improvements.

Deterministic cycling is another major result. About 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. The cycling rate grows monotonically over the run in 118/121 cases, with median per-iteration slope +0.0030 and median 5 iterations between deletion and re-introduction. A three-way classifier separates literal, tuning-skeleton, and trivial forms; the median per-run tuning-skeleton share is 8%, with range 2–44%. The diff-vs-no-diff prompting mode is the strongest predictor: for example, deepseek-reasoner’s tuning share drops from 20% to 2% when diff generation is turned off.

Replay establishes that breakthroughs are structurally rather than lexically reproducible. Across 36 breakthroughs, the reported medians are St=(pt,Ct,f,Et,Mt),S_t = (p_t, C_t, f, E_t, M_t),9, ptp_t0, ptp_t1, and ptp_t2. Replays almost never reproduce the exact program, but typically recover most of the score from different code. The authors interpret this as evidence that score jumps reflect robust structural patterns in the prompt context rather than fragile lexical artifacts.

The tuning-gap analysis shows that, on math tasks, much of late-run improvement can be recovered post hoc by tuning a mid-run structure. With only 24 evaluator calls per target, Bayesian optimization improves in 22/36 cases and matches or exceeds the run’s final-best score on 13/15 intermediate programs, with median delta +0.025. The largest gain is the 1.70× improvement from 0.521 to 0.886 on a Heilbronn problem.

Evaluator overfitting is especially salient on ALE. Public scores are not the held-out judging metric. Re-scoring each run’s public best-so-far chain on AtCoder’s private test set shows that two of four frameworks overfit on at least 30% of the problems they were scored on, and the same problem can flip generalization sign between frameworks; the reported example is ahc024, with OpenEvolve at +1,606 private gain versus ShinkaEvolve at −1,610 despite public gains. Public best-so-far chains are therefore unreliable single-number summaries on ALE.

5. EvoReplay as Experience-replay Innovative Dynamics in MARL

In “Experience-replay Innovative Dynamics,” EvoReplay is identified with ERID, a replay-based MARL algorithm designed to move beyond the limits of replicator dynamics in null-stable games such as zero-sum games (Zhang et al., 21 Jan 2025). The paper begins from the observation that MARL suffers from instability and nonstationarity, and that continuous-time replicator dynamics, although central in EGT analyses of MARL, produce closed orbits around Nash equilibria in null-stable games. Time-averaging can sometimes recover convergence, but adapts poorly to environmental changes and can diffuse toward the simplex boundary.

ERID instead realizes innovative dynamics through replay-averaged rewards. Replicator dynamics are written as

ptp_t3

Brown–von Neumann–Nash dynamics use positive-part excess payoffs,

ptp_t4

and Smith dynamics use pairwise positive-part comparisons,

ptp_t5

The paper states that both BNN and Smith dynamics are known to converge globally to equilibrium sets in null-stable or zero-sum games.

ERID implements these dynamics via the generic replay-based policy update

ptp_t6

A replay buffer of size ptp_t7 stores the last ptp_t8 rewards, action index sets ptp_t9 partition those timesteps by action, and replay averages are

CtC_t0

Choosing CtC_t1 yields ERID-BNN, while CtC_t2 yields ERID-Smith. A constrained Smith–replicator-based pairwise revision protocol introduces bounds CtC_t3 and CtC_t4 on policy coordinates.

The theoretical statements are explicit. Theorem 1, Theorem 2, and Theorem 3 show that, for two-player normal-form games, if CtC_t5 and CtC_t6, then ERID trajectories converge to the trajectories of the corresponding BNN, Smith, and constrained Smith–replicator-based pairwise dynamics. The proof intuition is that replay averages approximate expected payoffs under the current policies, and the stochastic discrete-time updates then track the deterministic ODEs.

The empirical demonstrations use NashConv and relative NashConv. In stationary validation, ERID-BNN and ERID-Smith match the corresponding ODE trajectories in Matching Pennies and Biased Rock–Paper–Scissors. In nonstationary RPS, where payoff scaling changes every 3000 iterations, ERID-BNN quickly re-centers around the new equilibrium while replicator-based learners slow dramatically due to cumulative averaging bias. Under smoothly varying nonstationarity, ERID-Smith adapts faster but oscillates more, whereas ERID-BNN converges more slowly but tends to stay closer to equilibrium in the long run. The practical recommendation follows directly: use BNN for smoother convergence and better long-run proximity to equilibrium in zero-sum or null-stable settings, Smith when rapid adaptation is critical, and constrained Smith–replicator when action proportions must satisfy bounds.

6. Experience replayable conditions and replay stabilization in on-policy RL

A separate but closely related replay-centered line re-examines the assumption that experience replay is applicable only to off-policy algorithms. “Revisiting Experience Replayable Conditions” argues for stricter experience replayable conditions, or ERC, and identifies instability of policy improvements as a pivotal factor (Kobayashi, 2024). Off-policy algorithms such as SAC naturally satisfy ERC because their acceptable set coincides with all empirical data. For on-policy algorithms such as A2C, ERC requires controlling instability and mitigating distribution shift so that replayed data behaves as if on-policy.

The paper’s control-as-inference analysis defines optimality likelihoods through value functions and derives positive and negative policies CtC_t7 and CtC_t8 around the behavior policy CtC_t9. Policy improvement is expressed as a triplet-like objective,

ff0

Two instability factors then emerge: repulsive forces from negative samples and replays of inappropriate experiences. The corresponding stabilization tricks are counteraction of deviations from non-optimal policies and mining of indistinguishable experiences.

On-policyness is measured by the density-ratio-based quantity

ff1

and a discriminator is trained with

ff2

Counteraction uses

ff3

with PI-controlled gain

ff4

Mining applies the stochastic dropout rule

ff5

These mechanisms are used to modify A2C. The value loss remains ff6, the policy loss becomes ff7, and replayed transitions store behavior likelihoods ff8 at collection time. The implementation uses target networks for both ff9 and EtE_t0, CAT-soft updates with EtE_t1, AdaTerm with EtE_t2, a FIFO replay buffer of size EtE_t3, batch size EtE_t4, and replay ratio EtE_t5 per episode. In the reported ablations over 12 seeds on Reacher, Hopper, and HalfCheetah, only the condition with both tricks enabled achieved stable learning on all three Mujoco tasks. In dm_control, SAC solved QuadrupedWalk at high level while ERC-enabled A2C performed less consistently, whereas on Swimmer15D the ERC-enabled variant achieved high performance and SAC did not consistently solve the task.

Taken together, these replay-centered lines show that “EvoReplay” names a broader methodological orientation rather than a single canonical system. In evolutionary coding, replay is a diagnostic instrument for decomposing score gains into structural, parametric, and evaluator-specific mechanisms. In MARL, replay is the substrate by which empirical rewards instantiate innovative evolutionary dynamics. In ERC-aware on-policy RL, replay becomes viable only after explicit control of hard-negative repulsion and distribution shift. The common lesson is that replay is not merely a storage convenience: it determines what trajectories can be reconstructed, what dynamics can be realized, and what forms of stability or overfitting can be measured.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvoReplay.