Novelty-Search Self-Play (NSSP) in Multi-Agent Systems

Updated 10 July 2025

NSSP is an algorithmic paradigm that emphasizes behavioral diversity by valuing the novelty of strategies over immediate performance.
It employs dynamic archives and embedding-based novelty metrics to quantify and retain distinct policies in multi-agent environments.
Empirical studies demonstrate NSSP’s ability to explore unconventional tactics, offering advantages in reinforcement learning, adversarial simulations, and AI safety.

Novelty-Search Self-Play (NSSP) is an algorithmic paradigm that integrates behavioral diversity objectives into self-play systems, typically within multi-agent or adversarial domains. Unlike traditional self-play, which iteratively refines policies based on performance-driven criteria such as win rates or reward maximization, NSSP deliberately prioritizes the discovery and retention of novel strategies, regardless of their immediate effectiveness. The core premise is that open-ended exploration, rather than exclusive focus on exploitation, enables the emergence of diverse, sometimes unforeseen, solutions and can overcome limitations posed by local optima or premature convergence inherent to classical self-play frameworks (Dharna et al., 9 Jul 2025).

1. Conceptual Foundations and Motivation

NSSP is rooted in the principle of searching for agent behaviors that are maximally distinct from those previously encountered, rather than those that directly maximize an externally supplied objective function (Woolley et al., 2012). This approach is motivated by findings in evolutionary computation demonstrating that reliance on explicit objectives can be deceptive—leading to search processes that ignore promising stepping stones merely because they do not immediately appear to contribute toward the end goal (Woolley et al., 2012). By leveraging mechanisms that reward novelty—formally measured as sparseness in a defined behavior or policy space—NSSP seeks to maintain and expand the diversity of strategies, which classical self-play often fails to guarantee due to its inherent competitive convergence (Dharna et al., 9 Jul 2025).

The paradigm finds practical justification in competitive and coevolutionary systems, where arms races between agents can stall when evolution converges on narrow sets of behaviors. NSSP has therefore been adopted to promote continual innovation and deeper coverage of the strategic policy space (Gomes et al., 2014). Foundation Model Self-Play (FMSP), as recently proposed, recognizes that even powerful self-play systems powered by foundation models may stagnate in local optima unless explicit diversity objectives are interleaved with competitive improvement (Dharna et al., 9 Jul 2025).

2. Algorithmic Mechanisms and Novelty Metrics

The operational core of NSSP lies in the use of a novelty metric that quantifies the behavioral or representational distinctness of candidate strategies relative to an archival population. A canonical formulation is as follows:

$\text{novelty}(p) = \frac{1}{k}\sum_{i=1}^k \| E(p) - E(p_i) \|_2$

where $p$ is a candidate policy, $E(\cdot)$ is an embedding function (for instance, realized via a code or behavior embedding model), and $\{p_i\}_{i=1}^k$ are the $k$ -nearest neighbors in the archive (Dharna et al., 9 Jul 2025).

In FMSP, candidate code-policies generated by the foundation model are embedded into an $n$ -dimensional space (e.g., using OpenAI’s text-embedding-3-small model), enabling efficient nearest-neighbor search and scalable novelty determination (Dharna et al., 9 Jul 2025). Policies are added to the archive if their novelty exceeds a pre-specified threshold or rank, independent of their performance.

Traditional novelty search in evolutionary computation follows similar principles, relying on behavioral characterizations specific to the domain (e.g., trajectories in a maze, swarm clustering properties) and using Euclidean or domain-specific distances for novelty assessment (Woolley et al., 2012); (Gomes et al., 2013). The use of dynamic archives, regularly updated with generated behaviors, was found to be critical for preventing cyclical search and for counterbalancing exploration biases that arise from inappropriate metrics or nonlinear mappings between genotype and behavior space (Salehi et al., 2022).

3. Diversity-Driven Self-Play in Practice

Applied within foundation-model-driven self-play, NSSP operates by prompting the code-generation model with not only current population policies but also archived exemplars. The foundation model is encouraged to synthesize policies that are dissimilar to both recently generated and historical ones, increasing the probability of uncovering radically different strategies (e.g., exploration of reinforcement learning, tree search, or heuristic frameworks within the same domain) (Dharna et al., 9 Jul 2025).

A typical NSSP iteration proceeds as follows:

Sample a candidate policy from the foundation model, utilizing a context composed of current and archived policies.
Embed this policy in a vector space via $E(\cdot)$ .
Locate the $k$ nearest archive neighbors and compute the mean pairwise distance as the novelty score.
If the score surpasses the threshold, archive the policy without regard to its performance.
Repeat, accumulating a library of highly diverse strategies across the policy space.

This procedure contrasts with "Vanilla FMSP," which maintains only the current best policy per population and updates solely via competitive outcomes, and with "Quality-Diversity Self-Play" (QDSP), which mixes novelty with performance: QDSP replaces an archive entry if and only if a newly generated, sufficiently similar policy exhibits superior game performance (Dharna et al., 9 Jul 2025).

4. Empirical Outcomes and Strategic Implications

Experimental evaluation in tasks such as Car Tag (a continuous-control pursuer-evader setting) and Gandalf (an AI safety simulation red-teaming LLM defenses) reveals that NSSP delivers broad coverage of the policy space, rapidly populating the "QD-Map" with unique strategies. This indicates a high rate of exploratory innovation in both algorithmic design and emergent tactics. For example, in Car Tag, NSSP produced varied controllers using Monte Carlo tree search, Q-Learning, and hand-coded heuristics, some of which were previously unexplored by competitive self-play (Dharna et al., 9 Jul 2025).

However, an observed trade-off arises: while NSSP ensures strategic diversity, individual policies in the archive may not be locally optimized for performance, and the archive may contain many low-quality or unrefined strategies. By design, the algorithm forgoes performance-driven selection, focusing exclusively on maximizing behavioral novelty. Empirical evidence in the cited paper shows that pure NSSP often achieves superior diversity but may have lower mean win rates than approaches that mix or alternate novelty and quality optimization (such as QDSP) (Dharna et al., 9 Jul 2025); (Gomes et al., 2014).

In adversarial and AI safety domains, NSSP’s capacity to generate unorthodox or creative attacks (e.g., successful LLM jailbreaking strategies in Gandalf) demonstrates the importance of exploration across policy space, particularly when facing systems robust to conventional attack patterns (Dharna et al., 9 Jul 2025).

5. Relationships to Prior Evolutionary and Multi-Agent Research

Novelty-Search Self-Play is a direct extension of behavioral novelty search from evolutionary computation to self-play and multi-agent learning. Foundational work established the effectiveness of novelty-based search in solving deceptive or multi-modal problems where objective-driven methods quickly stagnate (Woolley et al., 2012); (Gomes et al., 2013). In coevolutionary systems, hybridizations that combine fitness (exploitation) and novelty (exploration)—such as Progressive Minimal Criteria Novelty Search (PMCNS) and linear scalarization—have been shown to balance coverage and competitiveness, preventing excessive "fruitless" exploration (Gomes et al., 2014).

Further research reveals that the use of a dynamic archive—rather than population-only reference—plays a corrective role, enabling both expansion into new areas and controlled backtracking as required to cover the behavior space thoroughly (Salehi et al., 2022). Intrinsic motivation techniques, such as novelty-producing synaptic plasticity (Yaman et al., 2020) and information-theoretic shaping of representational spaces (Tao et al., 2020), have also provided complementary mechanisms to enhance diversity-driven self-play in sparse- or deceptive-reward settings.

Recent hybrid frameworks, exemplified by adaptive allocation between exploratory and exploitative niches (e.g., Explore-Exploit $\gamma$ -Adaptive Learner, EyAL (Segal et al., 2022)), suggest that dynamically shifting the population focus based on progress measurements can maximize overall system utility—potentially informing future NSSP variants that selectively interleave novelty-driven exploration and performance optimization.

6. Implementation Considerations and Computational Aspects

Implementation of NSSP, especially at the foundation model scale, requires system infrastructure for (a) code- or policy-embedding generation, (b) efficient nearest-neighbor novelty computation in high-dimensional archival space, and (c) archive management and thresholding to prevent uncontrolled archive growth.

Efficient embedding models such as text-embedding-3-small enable scalable policy representation and retrieval in code-based NSSP applications, while vector indices (e.g., using FAISS or ScaNN) allow high-speed $k$ -nearest-neighbor queries on large archives (Dharna et al., 9 Jul 2025). The archive must retain all policies regardless of outcome, as performance is not used as a filter or replacement criterion (contrasting QDSP). A plausible implication is that additional sub-archival mechanisms or time-based pruning may be needed if computational resources are constrained, though this is not detailed in the underlying paper.

Defined novelty thresholds and batch sizes impact the rate of strategic turnover and the density of coverage in policy space. In high-dimensional or open-ended settings, appropriate metric selection—ensuring alignment between embedding space geometry and actual behavioral difference—is critical to effective novelty computation (Salehi et al., 2022); (Woolley et al., 2012).

7. Broader Implications, Limitations, and Future Directions

NSSP provides an algorithmic solution to the well-known exploration-exploitation dilemma in multi-agent learning, offering a means to seed open-ended innovation in strategic domains where conventional self-play is limited by local optima. Its utility is pronounced in adversarial, deceptive, and safety-critical domains, as well as in creative co-design applications such as procedural content generation (Beukman et al., 2022).

A key limitation, consistently reported, is that pure exploration often leads to archives saturated with unrefined or sub-optimal strategies. Hybridizations (e.g., QDSP, PMCNS, EyAL) appear necessary to combine the creative strength of NSSP with the strategic rigor of iteration-driven self-play (Dharna et al., 9 Jul 2025); (Gomes et al., 2014); (Segal et al., 2022). Optimal management of the novelty archive, metric learning in high-dimensional code or behavior spaces, and adaptation to high-stakes, resource-constrained settings remain significant directions for further research.

In sum, Novelty-Search Self-Play expands the frontier of multi-agent and foundation model-driven strategy innovation, by formalizing strategic diversity as a first-class objective. While the approach can drastically improve coverage and the serendipitous discovery of agent behaviors, it must often be combined with selection mechanisms oriented toward quality to produce solutions that are both novel and effective (Dharna et al., 9 Jul 2025); (Gomes et al., 2014); (Woolley et al., 2012).