Foundation-Model Self-Play

Updated 10 July 2025

Foundation-Model Self-Play is a set of algorithms that use pretrained models to generate and refine code-based policies, promoting strategic diversity and innovation.
It replaces traditional neural updates with FM-driven code generation and evaluation in simulated contests to overcome local optima.
Applications include AI safety, multi-agent games, and automated capability discovery, demonstrating scalable, open-ended exploration.

Foundation-Model Self-Play (FMSP) refers to a collection of algorithmic frameworks that leverage the capabilities of large-scale, pretrained ("foundation") models to drive open-ended, iterative improvement through self-play. Unlike traditional self-play algorithms that primarily focus on direct reinforcement learning of policies through repeated contests among fixed neural network architectures, FMSP exploits the code-generation, reasoning, and diverse knowledge encapsulated in foundation models (FMs) to facilitate broad innovation, strategic diversity, and scalable self-improvement across complex environments and tasks (Dharna et al., 9 Jul 2025). This approach has been proposed as a means to accelerate strategy discovery, promote robustness, and automate capability and safety evaluation within multi-agent and decision-making domains.

1. Foundational Principles and Motivation

FMSP extends classic self-play (SP) by embedding foundation models, such as transformers pretrained on code, language, or multimodal data, directly within the self-play loop. The principal premise is that FMs can generate, critique, and refine high-level representations of policies (notably, composable code or modular agents), which can be evaluated in simulated or real competitive settings.

Key motivations for FMSP include:

Overcoming local optima inherent in traditional SP by enabling “leaps” between fundamentally different strategies using FM-driven code or policy innovation;
Curating not only high-performing agents but also a diverse spectrum of strategies via open-endedness and novelty-seeking mechanisms;
Exploiting foundation models' breadth of prior knowledge to generate richer and more creative policies that extend beyond those discoverable through incremental neural policy updates.

These principles distinguish FMSP from prior self-play approaches that predominantly focus on neural policy improvement through direct gradient-based or evolutionary updates in weight space (Dharna et al., 9 Jul 2025).

2. Methodological Variants

FMSP algorithms typically operationalize self-play at the code or symbolic policy level, rather than exclusively over neural network parameters. The main variants are:

a. Vanilla Foundation-Model Self-Play (vFMSP)

vFMSP follows the classic self-play paradigm but replaces neural network policy updates with FM-generated code-policy revisions. Seed policies (often hand-designed) are iteratively refined through FM-driven proposals, evaluated by head-to-head contest outcomes, and adopted if they demonstrate improved performance.

b. Novelty-Search Self-Play (NSSP)

NSSP leverages foundation models to produce strategies that maximize representational novelty. A policy archive is maintained, and the FM is tasked with creating new policy code that is maximally distinct (according to embedding distance metrics) from those already present. NSSP eschews performance-based replacement, focusing only on expanding strategic diversity.

c. Quality-Diversity Self-Play (QDSP)

QDSP integrates elements of both competitive and novelty-driven approaches. It maintains an archive of policy code embeddings; newly generated policies from the FM are added if they are sufficiently novel, or replace similar archive members if they outperform them in direct evaluation. QDSP instantiates a “dimensionless” MAP-Elites process, removing the need for domain-specific behavioral descriptors, with embeddings providing a general measure of policy diversity (Dharna et al., 9 Jul 2025).

3. Technical Implementation

FMSP deploys foundation models as search or mutation operators, outputting explicit code modules (functions or classes) that define agent behavior.

The core process for each variant typically consists of:

Policy Representation: Policies are represented as code (e.g., Python classes or functions), mapping environment states $s$ to actions $a$ , i.e., $\pi(s) = a$ .
Mutation/Generation: An FM receives context (existing policies, performance metrics, task descriptions), then proposes a new or modified policy implementation.
Safety and Execution: Generated code is sandboxed and validated via unit tests before competitive rollout.
Archive Update: The new policy’s embedding is computed (often via text-embedding models). For QDSP, if the policy is novel or dominates its closest neighbor in performance, it is added or replaces that neighbor.
Evaluation: Continuous tournaments or batch evaluations are run to assess policy strength (ELO ratings, win rates) and diversity (embedding coverage, QD-Score).

An illustrative update for archive management in QDSP is:

$\text{If } d_\text{embed}(\pi_\text{new}, \text{archive}) > \delta, \text{ add } \pi_\text{new}; \text{ else if } R(\pi_\text{new}) > R(\pi_\text{neighbor}), \text{ replace.}$

where $d_\text{embed}$ is the embedding distance metric and $R$ is the performance metric.

4. Empirical Results and Applications

FMSP approaches have been evaluated in domains that test both control and safety:

Car Tag: A continuous-control pursuer-evader task where FM-generated policies span reinforcement learning, tree search, heuristics, genetic algorithms, and model-predictive control modules. In this setting, QDSP and vFMSP outperformed strong human-engineered baselines by discovering effective and diverse strategies; NSSP achieved greater diversity at the expense of performance (Dharna et al., 9 Jul 2025).
Gandalf: An AI safety simulation in which FMSP-generated attackers evolve prompts and code to jailbreak LLM-based defenders, progressively defeating “defense levels” equipped with regex, classifier, and LLM-based filters. FMSP is capable of innovating new red-teaming tactics and, in two-sided settings, auto-generating defender patches in response.
Automated Capability Discovery (ACD): FMs are used in a scientist-subject framework to generate, cluster, and score thousands of novel task families, substantially extending the breadth of benchmarks and aiding safety evaluation (Lu et al., 11 Feb 2025).

In all experiments, foundation models' abilities to “leap” in policy space and synthesize code from vast prior knowledge proved advantageous for escaping locally optimal but narrow solutions and for accelerating open-ended exploration.

5. Theoretical and Practical Implications

FMSP introduces a new axis to multi-agent learning, bridging foundation model research and open-endedness:

Strategic Innovation: By working in code and symbolic space, FMSP departs from the incremental improvements of classic self-play, permitting the discovery of fundamentally new policy classes.
Dimensionless Diversity: Use of embedding-based novelty metrics enables domain-agnostic quality-diversity search, circumventing the need for expert-crafted behavioral descriptors.
Automated Self-Red-Teaming and Patching: Continuous cycles of attack and defense show practical utility for AI safety, with the system automatically exposing and patching vulnerabilities in LLMs (Dharna et al., 9 Jul 2025).
Overcoming Performance Plateaus: By integrating QDSP, FMSP is able to maintain exploration pressure while still prioritizing incremental gains in policy quality.

A plausible implication is that FMSP can generalize beyond games to settings such as real-world robotics or negotiation, provided that policies can be encoded as executable modules and evaluated in controlled environments.

6. Limitations, Challenges, and Future Directions

FMSP also raises key challenges:

Evaluation and Metric Design: The abstraction of strategies as code necessitates general, scalable measures of both performance and diversity (e.g., embedding-based QD maps, ELO ratings across nontraditional tasks).
Computational Overhead: Executing and validating code-generated policies in the loop is more resource-intensive than neural weight updates. This may require distributed or cloud-based infrastructures.
Safe Execution: Sandboxing and rigorous unit-testing are necessary to prevent adversarial or unsafe code during rollouts.
Non-Stationarity and Convergence: As policies shift dramatically between iterations, ensuring stable arms-race dynamics without collapse or cycling remains an active problem.
Scalability: Extending FMSP to the scale of state-of-the-art FMs and highly complex, partially observable environments (e.g., real-world dialogue or control) remains a core area for future research.

Open questions include developing theoretical convergence guarantees for open-ended code-policy spaces and integrating FMSP with foundation model self-evaluation and automated capability discovery pipelines (Lu et al., 11 Feb 2025).

7. Relation to Broader Research and Adjacent Methods

FMSP draws on ideas from:

Quality-diversity (QD) and open-endedness research, notably dimensionless MAP-Elites approaches for maintaining diverse policy repertoires;
Population-based training, archival self-play, and curriculum learning as in policy space response oracles (PSRO) and neural evolution;
Large-model-based automated evaluation and benchmarking, as demonstrated in capacity discovery frameworks that use FMs as both task generators and evaluators (Lu et al., 11 Feb 2025);
Recent proposals in self-play for preference optimization and direct Nash equilibrium search in LLM alignment (Wu et al., 1 May 2024).

The paradigm is distinct in its leveraging of foundation models as generative search operators and as judges, as well as in its representation of policies as executable, human-readable modules, rather than pure neural parameters.

In sum, Foundation-Model Self-Play comprises a new family of self-play algorithms in which foundation models function as generators and critics of high-level, code-based policies. By explicitly encouraging both performance and strategic novelty, and by utilizing embedding-based diversity without manual behavioral descriptors, FMSP enables open-ended strategy innovation, ranging from multi-agent games and AI safety red-teaming to systematic capability discovery. This suggests FMSP is a promising direction for advancing both the autonomy and creative breadth of modern AI systems, subject to scaling, stability, and evaluation challenges yet to be fully addressed (Dharna et al., 9 Jul 2025, Lu et al., 11 Feb 2025).