Papers
Topics
Authors
Recent
Search
2000 character limit reached

Core War: Adversarial Program Evolution

Updated 8 January 2026
  • Core War is a Turing-complete adversarial game where assembly-like programs, known as warriors, contest control of a virtual machine.
  • It employs innovative self-play strategies—specifically the Digital Red Queen algorithm with MAP-Elites—to continually evolve robust and adaptive warriors.
  • Empirical studies demonstrate convergent evolution toward generalist strategies, validated by human-designed benchmarks and diverse performance metrics.

Core War is a computational environment and competitive programming game in which assembly-like programs, known as warriors, compete for control of a virtual machine. Originating in the field of artificial life, Core War offers a Turing-complete, fully sandboxed testbed for studying adversarial program evolution, self-play, and the dynamics of continual adaptation. The environment has found renewed significance as a model for open-ended adversarial processes and as a benchmark for evolutionary algorithms, including those driven by LLMs (Kumar et al., 6 Jan 2026).

1. Formalizing the Adversarial Objective in Core War

Traditional evolutionary or program synthesis frameworks typically employ static optimization—searching for a solution that maximizes a fixed fitness function. In contrast, Core War research, and specifically the Digital Red Queen (DRQ) algorithm, instantiates a continually shifting adversarial “Red Queen” arms race. Rather than optimizing against a single static objective, each new warrior is evolved to outperform an ever-expanding set of prior champions. Formally:

wt=argmaxw  Eseed[Fitness(w;{w0,...,wt1})]w_t = \arg\max_{w} \; \mathbb{E}_{\text{seed}}\left[ Fitness(w; \{w_0, ..., w_{t-1}\}) \right]

where w0w_0 denotes the seed warrior, {w1,...,wt1}\{w_1, ..., w_{t-1}\} are previous champions, and the expectation averages over randomized battle initializations. In contrast, static-target optimization seeks

w=argmaxw  E[Fitness(w;{wtarget})]w^* = \arg\max_{w} \; \mathbb{E}\left[ Fitness(w; \{w_{target}\}) \right]

with a fixed opponent.

Fitness in DRQ is context-dependent and based on survival and elimination: in an NN-way match of up to T\mathcal{T} timesteps, each living warrior shares N/TN/\mathcal{T} per timestep, yielding

Fitness(i;opponents)=τ=1TNTAi(τ)jAj(τ)Fitness(i; \text{opponents}) = \sum_{\tau=1}^{\mathcal{T}} \frac{N}{\mathcal{T}} \cdot \frac{A_i(\tau)}{\sum_j A_j(\tau)}

where Ai(τ)=1A_i(\tau) = 1 if warrior ii is alive at time τ\tau, and 0 otherwise.

2. Algorithmic Structure: The Digital Red Queen Self-Play Loop

The DRQ algorithm is structured as a multilevel self-play loop operating over a sequence of evolutionary rounds. At each round, a new warrior is evolved to defeat the current population of opponents selected from history. Optimization within a round uses MAP-Elites to preserve quality-diversity. The high-level pseudocode is summarized below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Input: initial champion w₀, rounds T, MAP-Elites grid C, history length K
History H = [w₀]
for t in 1…T:                                  # Outer (Red Queen) loop
  Opponents = last K warriors in H             # Select recent champions
  A.initialize_empty(C)                        # MAP-Elites archive
  for each w in Opponents: A.add_elite(w)

  for iter in 1…I:                             # Inner evolutionary search
    w_parent = A.random_cell_sample()
    w_child = LLM_mutate(w_parent)
    f = average_{s=1…S} Fitness(w_child; Opponents, seed=s)
    bd = BD(w_child)                           # Behavior descriptor
    A.try_update_cell(bd, w_child, f)
    
  wₜ = A.get_overall_best()                    # Select champion
  append wₜ to H
end for
Output: lineage H = [w₀, w₁, …, w_T]

Key hyperparameters include the total number of rounds TT (e.g., 10), the number of inner iterations II (e.g., 1,000), opponent-history length KK (e.g., 1, 3, or all previous), and the number of stochastic seeds SS per evaluation (e.g., 20) (Kumar et al., 6 Jan 2026).

3. Warrior Representation and LLM Mutation

Core War warriors are encoded in Redcode, a low-level language with approximately 20 opcodes (such as DAT, SPL, MOV, ADD), a set of modifiers, and diverse addressing modes. In the DRQ framework:

  • New generation: An LLM (GPT-4.1-mini) is prompted with the Redcode specification and tasked with producing a novel warrior.
  • Mutation: Given a wparentw_{\text{parent}}, the LLM receives the parent’s Redcode and is asked for a variant aimed at improved performance.

No fine-tuning is applied; the model leverages its pretrained knowledge augmented by context-specific instructions. The LLM thus samples from a conditional distribution PLLM(wwparent)P_{\mathrm{LLM}}(w \mid w_{\text{parent}}), guiding search over the high-dimensional space of Redcode programs.

4. Experimental Protocol and Evaluation Metrics

The Core War simulation in DRQ is specified as follows:

  • Core size: 8,000 memory cells arranged circularly.
  • Maximum timesteps: 80,000 per match.
  • Thread cap: 8,000 per warrior.
  • Program constraint: \leq 100 instructions.
  • Placement: Warriors seeded \geq100 cells apart, evaluated over 20 random initializations.

Baseline performance is assessed using a held-out set of 317 human-designed warriors. Generality of a warrior ww is the fraction of these human opponents defeated or tied by ww in zero-shot 1-on-1 matches.

MAP-Elites employs behavior descriptors BD(w)=(log(#spawned_threads),log(memory_coverage))BD(w) = (\log(\#\text{spawned\_threads}), \log(\text{memory\_coverage})), where:

  • #spawned_threads\#\text{spawned\_threads}: Total number of threads spawned (SPL) by ww.
  • memory_coverage\text{memory\_coverage}: Number of unique addresses written/read.

Additional metrics include:

  • Phenotype: Warrior’s fitness vector against all 317 human benchmarks.
  • Genotype: Text embedding of Redcode via OpenAI text-embedding-3.
  • Across-run diversity: Principal component and variance analyses on phenotype/genotype.
  • Cycle counts: Number of (a,b,c)(a, b, c) triplets forming rock–paper–scissors cycles in dominance relations.
  • Rate of change: phenotype(wt)phenotype(wt1)\| \text{phenotype}(w_t) - \text{phenotype}(w_{t-1}) \|.

5. Empirical Observations and Analysis

A. Static vs. Red Queen Baselines

In one-round (static) settings:

  • Zero-shot LLM: Defeats 1.7%\approx 1.7\% of 294 human warriors.
  • Best-of-8: Defeats 22.1%\approx 22.1\%.
  • Evolved specialist collectives: can defeat 96.3%\approx 96.3\%.
  • Single evolved specialist: Defeats 28%\approx 28\% on average.

B. Dynamics of Continual DRQ

Across 96 runs of multi-round DRQ:

  • Generality of warriors increases monotonically with round index (tt), with p0.001p \ll 0.001 under log-linear fit.
  • Across-run phenotypic variance and phenotype-change rates decrease as DRQ progresses, while genotypic variance remains roughly stable.
  • This pattern suggests convergent evolution toward generalist strategies—distinct underlying codes, but increasingly similar phenotypic profiles.

C. Opponent History Length

  • K=1K=1 (last opponent): Many dominance cycles observed.
  • K>1K > 1 or full history: Reduces cycles by 77%77\%; arms race stabilizes.

D. Quality-Diversity Role

Replacing MAP-Elites with a greedy single-cell approach degrades champion quality, especially in later rounds. This highlights the necessity of intra-round exploration and diversity preservation in adversarial evolution.

E. Code-Generality Predictiveness

Linear regression from text embeddings to final generality achieves R20.46R^2 \approx 0.46, evidencing nontrivial structure in the code underlying robustness and offering a foothold for future surrogate modeling and interpretability.

6. Broader Implications and Limitations

The DRQ approach positions Core War as a controllable, Turing-complete arena for adversarial program evolution. The algorithm demonstrates that even minimal self-play—sequential LLM-based generation, MAP-Elites-based search, and historical evaluation—yields robust generalist solutions. Advantages include clear parallels to cybersecurity arms races, safe sandboxed experimentation, and insight into open-ended adaptation (Kumar et al., 6 Jan 2026).

Several limitations are acknowledged:

  • The linear lineage (one champion per round) does not fully recapitulate the diversity and concurrency of complex ecosystems.
  • Computational expense may constrain scalability; surrogate-based fitness approximation may be required.
  • Behavioral descriptor and core parameter choices may bias evolutionary search; domain-agnostic alternatives remain an open question.

Ultimately, Core War’s integration with modern LLM-driven and quality-diversity techniques illustrates the transition from brittle specialization under static objectives to robust generalism via continual adaptation, and suggests that similar minimal Red Queen dynamics may have application in cybersecurity defense, red-teaming, and even models of biological resistance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Core War.