EvoReplay: Evolutionary and MARL Replay Dynamics
- EvoReplay is a replay-based framework that reconstructs evolutionary coding trajectories to differentiate between structural innovations and parameter tuning.
- In multi-agent reinforcement learning, it implements experience replay with revision protocols to realize innovative dynamics like BNN and Smith dynamics.
- It also informs ERC-aware on-policy RL by applying stabilization tricks that control distribution shifts and mitigate negative update biases.
Searching arXiv for the cited papers to ground the article in the current literature. EvoReplay denotes a replay-based methodological label that appears in distinct technical contexts in recent arXiv work. In evolutionary coding, it is the experimental layer built on top of EvoTrace to answer the question of what evolutionary coding agents evolve as program scores improve over time; it reconstructs local search states behind score jumps and replays, perturbs, and retunes them under controlled interventions (Pelleriti et al., 19 May 2026). In multi-agent reinforcement learning, EvoReplay refers to the same algorithmic idea as Experience-replay Innovative Dynamics (ERID), an experience replay-based update rule that uses revision protocols to realize innovative evolutionary dynamics such as Brown–von Neumann–Nash and Smith dynamics (Zhang et al., 21 Jan 2025). A related replay-centered line formalizes “experience replayable conditions” and derives stabilization tricks that make experience replay applicable to Advantage Actor-Critic in on-policy settings (Kobayashi, 2024).
1. Terminological scope and shared structure
The term is therefore not tied to a single invariant algorithm. In one usage, EvoReplay is a diagnostic framework for analyzing evolutionary coding traces; in another, it is a policy-update mechanism for MARL grounded in evolutionary game theory. A related body of work on experience replayable conditions supplies a stricter criterion for when replay can be used without inducing instability or bias in on-policy learning (Pelleriti et al., 19 May 2026).
| Usage | Primary object | Core mechanism |
|---|---|---|
| EvoReplay in evolutionary coding | Search traces over candidate programs | Replay, perturbation, retuning, ablation |
| EvoReplay as ERID in MARL | Policy simplex over actions | Experience replay with revision protocols |
| ERC-aware replay design | Replayed transitions in on-policy RL | Counteraction and mining of indistinguishable experiences |
A shared structural motif is that replay transforms logged trajectories into objects of intervention. In evolutionary coding, replay is attached to the original evaluator harness so that alternative edits, constants, model substitutions, and prompt contexts remain directly comparable to the recorded trajectory. In MARL, replay supplies empirical average rewards that substitute for fitness in revision protocols, allowing discrete-time updates to track continuous-time innovative dynamics. This suggests a family resemblance: replay is used either diagnostically, to attribute gains and failure modes, or algorithmically, to shape the policy dynamics themselves.
2. EvoReplay in evolutionary coding: objectives, data substrate, and workflow
In “What Do Evolutionary Coding Agents Evolve?”, EvoReplay is explicitly diagnostic rather than purely performance-oriented. Its objective is to distinguish structural innovations from parametric tuning, measure recombination and refactoring versus net helpfulness, detect deterministic cycling, and separate evaluator-overfitting from generalization (Pelleriti et al., 19 May 2026). The motivation is that a headline best-in-run score may conflate genuine algorithmic innovation, re-tuning of known strategies, recombination of ideas the model already “knows,” and evaluator-specific overfitting.
The framework operates on EvoTrace, a unified dataset of evolutionary coding runs normalized to a replayable schema. Each candidate program is stored with its full source, its parent(s), an exact unified diff to its parent, the LLM prompt and retrieved context used to generate it, and the complete evaluator outputs, including logs, metrics, and errors. EvoTrace comprises 121 runs across four evolutionary frameworks—OpenEvolve, GEPA, EvoX, and ShinkaEvolve—over 16 tasks spanning 6 Python mathematical discovery problems and 10 C++ ALE-bench competitive programming tasks. Five LLMs generate mutations: deepseek-reasoner, claude-sonnet-4.6, claude-haiku-4.5, gemini-3-flash-preview, and deepseek-chat. Each run consists of 100 iterations, for an aggregate scale of 10,672 unique programs, 18,400 LLM calls, and 274.7M prompt tokens.
EvoReplay’s methodology has four capabilities. Static analysis of traces normalizes parent→child edits into a table of parent, child, prompt, score, and unified diff, enabling measures such as lineage depth, best-so-far timing, program length, numeric-literal counts, and deterministic cycling detection. LLM-as-judge edit annotation applies a nine-type taxonomy to every edit with batching, schema validation, caching, and re-annotation under alternative judge models. Bayesian optimization over a single program’s exposed constants estimates how much of an evolutionary gain is recoverable by tuning alone. Replay stability of breakthroughs re-executes the saved generating prompt for a candidate under the original or a substituted model and summarizes outcomes as parse success, evaluation success, and score conditional on success.
Algorithmically, the package is built on SkyDiscover, ingests EvoTrace JSONL tables, and assumes replayability under the original evaluator, byte-identical prompts and sources, deterministic diffing, and scalar or comparable evaluator scores. Typical 100-iteration runs produce about 100 programs, about 134 LLM calls, and about 2.2M tokens, while replay interventions are comparatively modest: replay uses 10 resamples per target, Bayesian optimization uses 24 calls per target, and static analysis is negligible relative to run generation.
3. Formalization, notation, and controlled interventions
EvoReplay formalizes programs as executable artifacts and evaluators as scoring functions , or task-specific metrics convertible to a scalar fitness. A run induces a search graph with nodes for candidates and directed edges for parent→child edits. The local search state at iteration is
where is the current program, is the byte-identical prompt/context seen by the LLM, is the evaluator, encodes the set of edit operations applied in the parent→child diff , and denotes the generator model (Pelleriti et al., 19 May 2026).
For diff-based analysis, 0 returns multisets of added and removed lines. EvoReplay defines a literal recycling indicator 1, a tuning-recycling indicator 2 that collapses numeric literals via a placeholder NUM, and a trivial-recycling indicator for whitespace or comment-only changes. The per-edit cycling rate is
3
summarized by the per-run median 4 and linear-fit slope 5 over 6. For score attribution, an edit edge with score change 7 receives a multi-label set 8 from the edit taxonomy, and run-level contribution is written as 9. Helpfulness is summarized through odds ratios rather than means because score deltas are heavy-tailed and failure modes are bimodal.
The structural-versus-parametric decomposition writes a program as 0, with structure 1 and exposed hyperparameters 2. A single-program tuning ceiling is
3
implemented with 24 evaluator calls, specifically 8 random starts and 16 Bayesian-optimization acquisitions over a Gaussian-process surrogate. The tuning gap relative to a run’s final best is 4.
Replay reproducibility is summarized by 5, where 6 is the fraction of replays producing runnable code, 7 is the fraction that pass the evaluator, 8 is the fraction matching the original program byte-for-byte, and
9
These abstractions support several interventions. Parametric retuning uses one LLM call to identify tunable numeric literals and rewrites the program to expose them through a PARAMS block or #define macros. Pruning and repair remove phases or guards, then retune or add missing guard checks and sentinel validation. Model or context substitution re-executes the exact saved context 0 under the original 1 or a substituted model 2. Label-guided transformations alter external dependencies or efficiency-critical primitives. The reported examples are concrete: on a Heilbronn placement program, BO on an intermediate structure recovered and exceeded the run’s final-best, 0.886 versus 0.521; deleting a final “global shake” phase while retuning annealing constants yielded 3; one ALE program gained +56.2 rating points by checking INF_COST and function success; and introducing jax and optax for a numerical optimizer yielded 4.
4. Empirical findings in evolutionary coding
EvoReplay’s central empirical contribution is to show that score improvement mechanisms are heterogeneous and not reducible to end-of-run best fitness (Pelleriti et al., 19 May 2026). The edit taxonomy contains nine recurring edit types: Hyperparameter tuning, Local refinement, Architectural change, Composition, Efficiency, Bug fix, Pruning, Refactor, and External dependency. Edits are typically multi-label: 67.4% have at least two labels, 52.4% have exactly two, and 32.4% are single-label. The LLM-as-judge pipeline is validated against blind human re-annotation on 200 parent→child edits with macro Cohen’s 5, micro-F1 = 0.90, and exact-match accuracy 74.5%.
By frequency, Hyperparameter tuning dominates the search distribution. By per-edit helpfulness, however, External dependency has odds ratio 3.58× with 6, Efficiency has 1.61× with 7, and Architectural change has 1.55× with 8. Best-so-far updates and final-best lineages are enriched in Efficiency, External dependency, and Hyperparameter tuning relative to the base rate. The resulting frequency–utility gap indicates that evolutionary systems spend most of their effort on edits that rarely drive large improvements.
Deterministic cycling is another major result. About 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. The cycling rate grows monotonically over the run in 118/121 cases, with median per-iteration slope +0.0030 and median 5 iterations between deletion and re-introduction. A three-way classifier separates literal, tuning-skeleton, and trivial forms; the median per-run tuning-skeleton share is 8%, with range 2–44%. The diff-vs-no-diff prompting mode is the strongest predictor: for example, deepseek-reasoner’s tuning share drops from 20% to 2% when diff generation is turned off.
Replay establishes that breakthroughs are structurally rather than lexically reproducible. Across 36 breakthroughs, the reported medians are 9, 0, 1, and 2. Replays almost never reproduce the exact program, but typically recover most of the score from different code. The authors interpret this as evidence that score jumps reflect robust structural patterns in the prompt context rather than fragile lexical artifacts.
The tuning-gap analysis shows that, on math tasks, much of late-run improvement can be recovered post hoc by tuning a mid-run structure. With only 24 evaluator calls per target, Bayesian optimization improves in 22/36 cases and matches or exceeds the run’s final-best score on 13/15 intermediate programs, with median delta +0.025. The largest gain is the 1.70× improvement from 0.521 to 0.886 on a Heilbronn problem.
Evaluator overfitting is especially salient on ALE. Public scores are not the held-out judging metric. Re-scoring each run’s public best-so-far chain on AtCoder’s private test set shows that two of four frameworks overfit on at least 30% of the problems they were scored on, and the same problem can flip generalization sign between frameworks; the reported example is ahc024, with OpenEvolve at +1,606 private gain versus ShinkaEvolve at −1,610 despite public gains. Public best-so-far chains are therefore unreliable single-number summaries on ALE.
5. EvoReplay as Experience-replay Innovative Dynamics in MARL
In “Experience-replay Innovative Dynamics,” EvoReplay is identified with ERID, a replay-based MARL algorithm designed to move beyond the limits of replicator dynamics in null-stable games such as zero-sum games (Zhang et al., 21 Jan 2025). The paper begins from the observation that MARL suffers from instability and nonstationarity, and that continuous-time replicator dynamics, although central in EGT analyses of MARL, produce closed orbits around Nash equilibria in null-stable games. Time-averaging can sometimes recover convergence, but adapts poorly to environmental changes and can diffuse toward the simplex boundary.
ERID instead realizes innovative dynamics through replay-averaged rewards. Replicator dynamics are written as
3
Brown–von Neumann–Nash dynamics use positive-part excess payoffs,
4
and Smith dynamics use pairwise positive-part comparisons,
5
The paper states that both BNN and Smith dynamics are known to converge globally to equilibrium sets in null-stable or zero-sum games.
ERID implements these dynamics via the generic replay-based policy update
6
A replay buffer of size 7 stores the last 8 rewards, action index sets 9 partition those timesteps by action, and replay averages are
0
Choosing 1 yields ERID-BNN, while 2 yields ERID-Smith. A constrained Smith–replicator-based pairwise revision protocol introduces bounds 3 and 4 on policy coordinates.
The theoretical statements are explicit. Theorem 1, Theorem 2, and Theorem 3 show that, for two-player normal-form games, if 5 and 6, then ERID trajectories converge to the trajectories of the corresponding BNN, Smith, and constrained Smith–replicator-based pairwise dynamics. The proof intuition is that replay averages approximate expected payoffs under the current policies, and the stochastic discrete-time updates then track the deterministic ODEs.
The empirical demonstrations use NashConv and relative NashConv. In stationary validation, ERID-BNN and ERID-Smith match the corresponding ODE trajectories in Matching Pennies and Biased Rock–Paper–Scissors. In nonstationary RPS, where payoff scaling changes every 3000 iterations, ERID-BNN quickly re-centers around the new equilibrium while replicator-based learners slow dramatically due to cumulative averaging bias. Under smoothly varying nonstationarity, ERID-Smith adapts faster but oscillates more, whereas ERID-BNN converges more slowly but tends to stay closer to equilibrium in the long run. The practical recommendation follows directly: use BNN for smoother convergence and better long-run proximity to equilibrium in zero-sum or null-stable settings, Smith when rapid adaptation is critical, and constrained Smith–replicator when action proportions must satisfy bounds.
6. Experience replayable conditions and replay stabilization in on-policy RL
A separate but closely related replay-centered line re-examines the assumption that experience replay is applicable only to off-policy algorithms. “Revisiting Experience Replayable Conditions” argues for stricter experience replayable conditions, or ERC, and identifies instability of policy improvements as a pivotal factor (Kobayashi, 2024). Off-policy algorithms such as SAC naturally satisfy ERC because their acceptable set coincides with all empirical data. For on-policy algorithms such as A2C, ERC requires controlling instability and mitigating distribution shift so that replayed data behaves as if on-policy.
The paper’s control-as-inference analysis defines optimality likelihoods through value functions and derives positive and negative policies 7 and 8 around the behavior policy 9. Policy improvement is expressed as a triplet-like objective,
0
Two instability factors then emerge: repulsive forces from negative samples and replays of inappropriate experiences. The corresponding stabilization tricks are counteraction of deviations from non-optimal policies and mining of indistinguishable experiences.
On-policyness is measured by the density-ratio-based quantity
1
and a discriminator is trained with
2
Counteraction uses
3
with PI-controlled gain
4
Mining applies the stochastic dropout rule
5
These mechanisms are used to modify A2C. The value loss remains 6, the policy loss becomes 7, and replayed transitions store behavior likelihoods 8 at collection time. The implementation uses target networks for both 9 and 0, CAT-soft updates with 1, AdaTerm with 2, a FIFO replay buffer of size 3, batch size 4, and replay ratio 5 per episode. In the reported ablations over 12 seeds on Reacher, Hopper, and HalfCheetah, only the condition with both tricks enabled achieved stable learning on all three Mujoco tasks. In dm_control, SAC solved QuadrupedWalk at high level while ERC-enabled A2C performed less consistently, whereas on Swimmer15D the ERC-enabled variant achieved high performance and SAC did not consistently solve the task.
Taken together, these replay-centered lines show that “EvoReplay” names a broader methodological orientation rather than a single canonical system. In evolutionary coding, replay is a diagnostic instrument for decomposing score gains into structural, parametric, and evaluator-specific mechanisms. In MARL, replay is the substrate by which empirical rewards instantiate innovative evolutionary dynamics. In ERC-aware on-policy RL, replay becomes viable only after explicit control of hard-negative repulsion and distribution shift. The common lesson is that replay is not merely a storage convenience: it determines what trajectories can be reconstructed, what dynamics can be realized, and what forms of stability or overfitting can be measured.