Core War: Adversarial Program Evolution
- Core War is a Turing-complete adversarial game where assembly-like programs, known as warriors, contest control of a virtual machine.
- It employs innovative self-play strategies—specifically the Digital Red Queen algorithm with MAP-Elites—to continually evolve robust and adaptive warriors.
- Empirical studies demonstrate convergent evolution toward generalist strategies, validated by human-designed benchmarks and diverse performance metrics.
Core War is a computational environment and competitive programming game in which assembly-like programs, known as warriors, compete for control of a virtual machine. Originating in the field of artificial life, Core War offers a Turing-complete, fully sandboxed testbed for studying adversarial program evolution, self-play, and the dynamics of continual adaptation. The environment has found renewed significance as a model for open-ended adversarial processes and as a benchmark for evolutionary algorithms, including those driven by LLMs (Kumar et al., 6 Jan 2026).
1. Formalizing the Adversarial Objective in Core War
Traditional evolutionary or program synthesis frameworks typically employ static optimization—searching for a solution that maximizes a fixed fitness function. In contrast, Core War research, and specifically the Digital Red Queen (DRQ) algorithm, instantiates a continually shifting adversarial “Red Queen” arms race. Rather than optimizing against a single static objective, each new warrior is evolved to outperform an ever-expanding set of prior champions. Formally:
where denotes the seed warrior, are previous champions, and the expectation averages over randomized battle initializations. In contrast, static-target optimization seeks
with a fixed opponent.
Fitness in DRQ is context-dependent and based on survival and elimination: in an -way match of up to timesteps, each living warrior shares per timestep, yielding
where if warrior is alive at time , and 0 otherwise.
2. Algorithmic Structure: The Digital Red Queen Self-Play Loop
The DRQ algorithm is structured as a multilevel self-play loop operating over a sequence of evolutionary rounds. At each round, a new warrior is evolved to defeat the current population of opponents selected from history. Optimization within a round uses MAP-Elites to preserve quality-diversity. The high-level pseudocode is summarized below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Input: initial champion w₀, rounds T, MAP-Elites grid C, history length K
History H = [w₀]
for t in 1…T: # Outer (Red Queen) loop
Opponents = last K warriors in H # Select recent champions
A.initialize_empty(C) # MAP-Elites archive
for each w in Opponents: A.add_elite(w)
for iter in 1…I: # Inner evolutionary search
w_parent = A.random_cell_sample()
w_child = LLM_mutate(w_parent)
f = average_{s=1…S} Fitness(w_child; Opponents, seed=s)
bd = BD(w_child) # Behavior descriptor
A.try_update_cell(bd, w_child, f)
wₜ = A.get_overall_best() # Select champion
append wₜ to H
end for
Output: lineage H = [w₀, w₁, …, w_T] |
Key hyperparameters include the total number of rounds (e.g., 10), the number of inner iterations (e.g., 1,000), opponent-history length (e.g., 1, 3, or all previous), and the number of stochastic seeds per evaluation (e.g., 20) (Kumar et al., 6 Jan 2026).
3. Warrior Representation and LLM Mutation
Core War warriors are encoded in Redcode, a low-level language with approximately 20 opcodes (such as DAT, SPL, MOV, ADD), a set of modifiers, and diverse addressing modes. In the DRQ framework:
- New generation: An LLM (GPT-4.1-mini) is prompted with the Redcode specification and tasked with producing a novel warrior.
- Mutation: Given a , the LLM receives the parent’s Redcode and is asked for a variant aimed at improved performance.
No fine-tuning is applied; the model leverages its pretrained knowledge augmented by context-specific instructions. The LLM thus samples from a conditional distribution , guiding search over the high-dimensional space of Redcode programs.
4. Experimental Protocol and Evaluation Metrics
The Core War simulation in DRQ is specified as follows:
- Core size: 8,000 memory cells arranged circularly.
- Maximum timesteps: 80,000 per match.
- Thread cap: 8,000 per warrior.
- Program constraint: 100 instructions.
- Placement: Warriors seeded 100 cells apart, evaluated over 20 random initializations.
Baseline performance is assessed using a held-out set of 317 human-designed warriors. Generality of a warrior is the fraction of these human opponents defeated or tied by in zero-shot 1-on-1 matches.
MAP-Elites employs behavior descriptors , where:
- : Total number of threads spawned (SPL) by .
- : Number of unique addresses written/read.
Additional metrics include:
- Phenotype: Warrior’s fitness vector against all 317 human benchmarks.
- Genotype: Text embedding of Redcode via OpenAI text-embedding-3.
- Across-run diversity: Principal component and variance analyses on phenotype/genotype.
- Cycle counts: Number of triplets forming rock–paper–scissors cycles in dominance relations.
- Rate of change: .
5. Empirical Observations and Analysis
A. Static vs. Red Queen Baselines
In one-round (static) settings:
- Zero-shot LLM: Defeats of 294 human warriors.
- Best-of-8: Defeats .
- Evolved specialist collectives: can defeat .
- Single evolved specialist: Defeats on average.
B. Dynamics of Continual DRQ
Across 96 runs of multi-round DRQ:
- Generality of warriors increases monotonically with round index (), with under log-linear fit.
- Across-run phenotypic variance and phenotype-change rates decrease as DRQ progresses, while genotypic variance remains roughly stable.
- This pattern suggests convergent evolution toward generalist strategies—distinct underlying codes, but increasingly similar phenotypic profiles.
C. Opponent History Length
- (last opponent): Many dominance cycles observed.
- or full history: Reduces cycles by ; arms race stabilizes.
D. Quality-Diversity Role
Replacing MAP-Elites with a greedy single-cell approach degrades champion quality, especially in later rounds. This highlights the necessity of intra-round exploration and diversity preservation in adversarial evolution.
E. Code-Generality Predictiveness
Linear regression from text embeddings to final generality achieves , evidencing nontrivial structure in the code underlying robustness and offering a foothold for future surrogate modeling and interpretability.
6. Broader Implications and Limitations
The DRQ approach positions Core War as a controllable, Turing-complete arena for adversarial program evolution. The algorithm demonstrates that even minimal self-play—sequential LLM-based generation, MAP-Elites-based search, and historical evaluation—yields robust generalist solutions. Advantages include clear parallels to cybersecurity arms races, safe sandboxed experimentation, and insight into open-ended adaptation (Kumar et al., 6 Jan 2026).
Several limitations are acknowledged:
- The linear lineage (one champion per round) does not fully recapitulate the diversity and concurrency of complex ecosystems.
- Computational expense may constrain scalability; surrogate-based fitness approximation may be required.
- Behavioral descriptor and core parameter choices may bias evolutionary search; domain-agnostic alternatives remain an open question.
Ultimately, Core War’s integration with modern LLM-driven and quality-diversity techniques illustrates the transition from brittle specialization under static objectives to robust generalism via continual adaptation, and suggests that similar minimal Red Queen dynamics may have application in cybersecurity defense, red-teaming, and even models of biological resistance.