Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Catanatron Framework: AI Strategic Simulator

Updated 30 June 2025

Catanatron Framework is a Python-based simulator that emulates the Settlers of Catan game, supporting AI research in strategic planning and benchmarking.
It integrates custom AI agents using robust APIs, controlled randomness, and partial observability for systematic evaluation of long-horizon reasoning.
The platform features autonomous multi-agent evolution, iterative self-improvement, and role specialization to refine in-game strategies without human intervention.

The Catanatron framework is an open-source, Python-based simulator that replicates the full rules and strategic landscape of the Settlers of Catan board game. Designed to expose long-horizon reasoning challenges, the framework enables integration of custom AI agents via robust APIs, structured data access, and fair benchmarking protocols. Catanatron supports core features such as partial observability, randomness (dice rolls), and multi-agent game play, providing a comprehensive environment for evaluating and evolving strategic planning abilities in LLM-based agents.

1. Experimental Architecture and Agent Progression

Catanatron has facilitated systematic benchmarking of increasingly sophisticated LLM agent architectures, which are explicitly defined and compared in a developmental sequence:

BaseAgent: Receives raw structured game state as input and directly outputs an action, functioning without any prompt engineering or supplementary strategic context.
StructuredAgent: Utilizes human-crafted prompts that arrange input information, enumerate available actions, and provide basic strategic principles such as resource prioritization and turn planning. These prompts are designed externally by experts and guide the LLM to improve upon naive decision-making.
PromptEvolver: Implements a multi-agent iterative prompt evolution loop, wherein an Evolver agent systematically analyzes game outcomes and modifies the Player agent's prompt based on empirical feedback, summaries, and external resources. This process continues for up to 10 evolution cycles, iteratively optimizing prompt strategies to enhance long-term in-game planning.
AgentEvolver: Orchestrates a full multi-agent self-evolving architecture, distributing the roles of Analyzer, Researcher, Strategizer, Coder, Player, and Evolver among specialized LLM agents. This architecture introduces autonomous prompt and code self-modification, enabling persistent adaptation beyond human-crafted guidance.

2. Self-Evolution and Strategic Optimization

Self-evolution within Catanatron is defined as an autonomous process where LLM agents analyze outcomes, diagnose weaknesses, propose and implement improvements to prompts or code, and redeploy for further testing without human intervention. Operationally, the iterative evolution loop is instantiated as follows:

Let $P^{(k)}$ be the Player agent at evolution step $k$ , and $S^{(k)}$ denote the corresponding average strategic score (such as victory points):

$\begin{align*} \textbf{(1) Play:} && \text{Run } N \text{ games with } P^{(k)}; \text{ collect logs and metrics} \ \textbf{(2) Analyze:} && \text{Analyzer}(\text{Game logs}^{(k)}) \rightarrow \text{diagnostics} \ \textbf{(3) Research/Strategize:} && \text{Researcher/Strategizer:}~ f(\text{diagnostics, external info}) \rightarrow \text{recommendations} \ \textbf{(4) Code Update:} && \text{Coder}(\text{Player code}^{(k)}, \text{recommendations}) \rightarrow \text{Player code}^{(k+1)} \ \textbf{(5) Deploy:} && P^{(k+1)} = \text{Player code}^{(k+1)} \ \textbf{(6) Repeat:} && k := k+1 \end{align*}$

Performance improvement is monitored with $S^{(k+1)} > S^{(k)}$ as a target. Agents aim to achieve adaptive behaviors such as prioritizing high-value resources and adjusting to recurrent failure modes through repeated autonomous self-revision.

3. Multi-Agent Collaboration and Role Specialization

The AgentEvolver architecture is characterized by explicit division of cognitive labor among LLM agents with specialized functions:

Analyzer: Reviews gameplay logs to diagnose specific tactical and strategic deficiencies (e.g., failing to upgrade settlements, inefficient trading patterns).
Researcher: Employs documentation and optional web tools to answer game strategy or code-related questions, leveraging broader knowledge resources.
Strategizer: Synthesizes analytical outputs into coherent, actionable high-level plans, focusing on overarching objectives such as maximizing resource access or pursuing Largest Army bonuses.
Coder: Translates strategic recommendations into concrete code or prompt modifications, thus affecting the low-level behavioral policy of the Player agent.
Player: Executes the current agent logic within Catanatron to generate fresh gameplay data for ongoing analysis.
Evolver: Integrates all findings, triggers update cycles, and maintains the evolutionary “memory” of prior game states, code versions, and learned strategies.

Agents maintain persistent contextual state between cycles, allowing for meta-learning and iterative refinement across generations. Collaborative modularization enables more systematic exploration and exploitation of strategic improvements relative to monolithic agent architectures.

4. Empirical Evaluation and Strategic Behaviors

Quantitative performance is assessed using metrics such as average victory points (VP), win rate, and in-game development milestones (e.g., number of settlements, roads, armies, cities). The following table, directly extracted from the research, summarizes relative improvement over BaseAgent for different models and agent configurations:

Agent	GPT-4o Δ%	Claude 3.7 Δ%	Mistral Δ%
BaseAgent	0%	0%	0%
StructuredAgent	+6%	+11%	-31%
PromptEvolver	+22%	+95%	+3%
AgentEvolver	+36%	+40%	+34%

PromptEvolver and AgentEvolver configurations achieved significant gains, with Claude 3.7 showing up to 95% improvement over BaseAgent by autonomously evolving prompts that capture more effective long-horizon strategic plans. All self-evolving agents developed emergent behaviors such as earlier city-building, diversified resource targeting, and improved negotiation, capabilities largely absent from static or prompt-engineered baselines.

Despite these advances, AgentEvolver did not surpass hand-crafted deterministic AlphaBeta bots, a limitation attributed in the paper to agent memory constraints and incomplete abstraction of long-term strategies.

5. Methodological and Algorithmic Considerations

Catanatron offers APIs and data structures that support systematic logging, fair comparison, and open-ended exploration. The framework's stochastic elements (dice, trade randomness) can be controlled for experimental reproducibility. The self-modifying LLM agent algorithms resemble evolutionary search, but leverage the LLMs’ capacity for both natural language and code generation, enabling self-analysis and revision entirely within the AI system.

At a high level, agent self-improvement is formalized as

$P^{(k+1)} = F(P^{(k)}, D^{(k)})$

where $F$ represents the composite update: analyzing outcomes $D^{(k)}$ , strategic proposal, and code or prompt revision.

6. Limitations, Generalization, and Future Directions

The AgentEvolver architecture's success in outperforming static LLM agents is counterbalanced by several constraints:

Plateauing below the performance of optimized search-based agents, indicating potential ceiling effects related to context window limitations or current LLM planning abstractions.
Absence of persistent, scalable long-term memory mechanisms, which may limit multi-stage strategic execution.
Significant computational and resource demands associated with repeated multi-agent iterative evolution.

All improvements were achieved autonomously, with no human-in-the-loop intervention after the initial architecture setup. The modular, specialization-driven framework is suggested to be generalizable to domains beyond Catan and can, in principle, exploit a wide array of LLM models and other complex strategic environments.

7. Synthesis and Implications

Catanatron serves as a high-fidelity platform for probing and advancing LLM strategic planning via autonomous, self-evolving multi-agent systems. Explicit division of labor among Analyzer, Researcher, Coder, Strategizer, Player, and Evolver agents enables the system to repeatedly diagnose weaknesses, devise solutions, revise their operating logic, and empirically validate progress over adaptive cycles. The approach demonstrates that LLM-based agents can progress from static “game players” to entities capable of recursively designing and improving both their own prompts and functional behaviors, providing new avenues for scalable and interpretable AI research in long-horizon strategic reasoning.

PDF Markdown Chat (Upgrade)