LLMs in Wargames: Human vs Machine
- The paper demonstrates that LLMs achieve domain-specific superhuman reasoning speed in menu-driven, analytical wargames while underperforming in creative, adversarial tasks.
- The topic is defined by a taxonomy of wargames that contrasts deterministic, analytical simulations with open-ended narrative exercises, highlighting quantifiable metrics like win-rate differentials and Elo ratings.
- LLM agents use structured pipelines with strategic planners and tactical executors, yet exhibit limits in situation synthesis, opponent modeling, and produce inconsistent narratives in open-ended simulations.
Human vs. Machine: LLMs and Wargames
LLMs have emerged as autonomous agents capable of participating in, adjudicating, and even designing wargames ranging from rule-bound analytical simulations (e.g., chess) to open-ended, narrative-driven strategic exercises. The critical question is whether, and under what conditions, LLMs can rival or surpass human performance in cognitive, social, and adversarial dimensions of wargaming. Rigorous quantitative and qualitative comparisons across several experimental platforms reveal both marked capabilities and prominent failure modes. As of 2026, LLMs exhibit domain-specific superhumanity (in reasoning speed and breadth) yet substantive human inferiority in creative, adversarial, and high-order strategic contexts.
1. Taxonomies and Types of Wargames: Analytical vs. Creative Dimensions
An ontology of wargames is defined by the axes of player flexibility (analytical: menu actions; creative: open-ended natural language) and adjudicator flexibility (analytical: deterministic, rules-based; creative: narrative, expert, or generative) (Matlin et al., 21 Sep 2025). This 2×2 classification yields four canonical classes:
- Analytical/Analytical: Chess, Go, StarCraft II. Human–machine comparison emphasizes search/exploration efficiency and tactical reasoning. LLMs, when combined with search (ChessGPT), underperform state-of-the-art RL agents by Elo differentials of 200–400 (Matlin et al., 21 Sep 2025, Feldman et al., 2020).
- Analytical/Creative: Free Kriegsspiel, naval Fleet Problems. Players operate within constrained menus; outcomes narrated by humans or LMs based on expertise or scenario rules.
- Creative/Analytical: Diplomacy with rule-driven adjudication; social-deduction games like Avalon. LLMs can achieve near-parity in negotiation-rich games (Meta/Cicero matches strong humans on supply centers), but trail in social-deduction success rates (ChatGPT: 35% vs. human: 70%) (Matlin et al., 21 Sep 2025).
- Creative/Creative: Tabletop RPGs, matrix wargames, seminar exercises; narrative and action in unconstrained language. LLMs as player or adjudicator induce coherence and safety risks with human DMs showing <5% rule-violation while LMs require extensive correction (20–40% hallucinations) (Matlin et al., 21 Sep 2025, Hogan et al., 2024).
These distinctions clarify that LLMs excel more rapidly as player/actor under analytically adjudicated or menu-driven constraints, but their performance and safety degrade as creative/narrative degrees of freedom are granted to either role.
2. Agent Architectures and Decision-Making Methodologies
LLM-based wargame agents employ structured pipelines that parallel human strategic cognition. For quantitative, multi-agent games, a two-layer architecture is standard (Sun et al., 2023):
- Strategic Planner: where global state , memory , and reflection are synthesized into high-level plans.
- Tactical Executor: For each agent , , refining collective plans with local information.
Decision cycles integrate chain-of-thought decomposition, memory/RAG (Retrieval-Augmented Generation) support, adversarial simulation (e.g., light forward sim for Unciv/CivAgent (Wang et al., 28 Feb 2025)), and social skill libraries for diplomacy/deception.
In open-ended qualitative environments, architectures such as Snow Globe (Hogan et al., 2024) treat the environment as a dynamic history sequence, with agent policies and LLM-based Control for scenario adjudication. Stochasticity is injected via sampling temperature, and persona conditioning modulates narrative but shows limited efficacy in simulating genuine inter-human disagreement or group trait effects (Lamparth et al., 2024).
3. Human–Machine Behavioral Comparisons in Strategic Simulations
Empirical studies across quantitative and qualitative wargames reveal both convergences and divergences between LLM and human play:
| Domain | Human Performance | LLM Performance | Key Differences |
|---|---|---|---|
| Chess Moves (Feldman et al., 2020) | Baseline: 188,324 moves, legality >99% | GPT-2: 155,394 moves, legality 99.7%, r=0.978 corr. | Slight pawn/queen bias in LLM; rare illegal moves |
| Analytical Wargame (Sun et al., 2023) | RL baselines: DQN/PPO win-rates 30–50% | LLM agent: 60–75% win-rate (prompt-optimized) | LLMs > RL in understandability, adaptation |
| Team Crisis Sim (Lamparth et al., 2024) | 21 teams, N=107 experts, agg. score 0.28 | GPT-3.5: Agg. 0.35 (no dialog); GPT-4: 0.30 | LLMs more aggressive, insensitive to persona |
| Diplomacy (Matlin et al., 21 Sep 2025) | Experienced teams, SC ≈ 24 per game | LM+search: ≈25 SC/game; Avalon ChatGPT 35% | LM success in negotiation; poor in social deduction |
| Policy Gen (WGSR) (Yin et al., 12 Jun 2025) | Elite: 92.3; Professional: 80.7 (score) | GPT-4.1: 60.0; coalition gap: 33 points | LLMs lag humans in high-order reasoning |
LLMs approximate human aggregate move distributions in chess and menu-driven games, and generate plausible English for strategic plans and orders. In crisis simulations, they agree with humans on 50–60% of choices (F1_micro ≈ 0.54), but diverge on escalation, sequence consistency, and trait representation (Solopova et al., 2 Mar 2026, Lamparth et al., 2024). In fully open-ended wargames, LLM output coherence, creativity, and safety remain substantially below human moderators and players.
4. Strengths, Limitations, and Emergent Properties of LLMs as Wargame Agents
Strengths:
- Rapid Adaptation: In IPD tournaments, LLMs rival or surpass best classical strategies in payoff ( ≈ 2.85), exhibiting faster adaptation to opponent switches than humans (3.7 vs. 5.4 rounds) (Singh et al., 5 Sep 2025).
- Natural-Language Explainability: Agents produce readable, audit-friendly rationales, enabling transparent oversight and human-AI mixed initiative (Sun et al., 2023).
- Scenario Generality: Prompt-tuned LLMs cross generalize across map layouts and tasks without retraining, unlike RL agents.
Limitations:
- Situation Synthesis: Performance drops catastrophically when >3 environment variables must be integrated (synthesis: 85.9% human; 37.5% best AI) (Yin et al., 12 Jun 2025).
- Opponent Modeling: Proficient at trait induction but poor at deductive simulation and high-risk counterfactual reasoning (Yin et al., 12 Jun 2025).
- Multi-Agent Coordination: Coalition planning scores lag humans by ≈33 points; interdependent goal reasoning remains unsolved (Yin et al., 12 Jun 2025).
- Adversarial and Escalatory Bias: LLMs are sensitive to prompt and pretraining artifacts, sometimes exhibiting over-escalation, insensitivity to explicit persona cues, and lack of argumentative conflict in simulated dialogs (Lamparth et al., 2024, Solopova et al., 2 Mar 2026).
- Hallucination & Fidelity: In open-ended settings, narrative outcomes may be inconsistent, ahistorical, or divorced from underlying scenario logic (Hogan et al., 2024, Matlin et al., 21 Sep 2025).
5. Prompt Engineering, Safety, and Evaluative Best Practices
LLM intelligence and behavior in wargames are acutely prompt-dependent (Sun et al., 2023, Lamparth et al., 2024). Prompt structure, persona specificity, and example inclusion directly affect performance, escalation propensity, and dialog structure. Explicit adjudication prompts, persona-visible histories, and structured reasoning blocks enhance output quality, but cannot fully compensate for underlying model limitations.
Referential safety protocols suggest:
- Audit transparency: logging prompt, seed, model version (Matlin et al., 21 Sep 2025)
- Confidence calibration and adversarial robustness testing (paraphrases, adversarial prefixes)
- Cross-model critique (auxiliary LMs)
- Human-in-the-loop for red-line enforcement, scenario complexity, and narrative realism
- Limiting system-prompt mutability in production deployments
Failure to adhere leads to rapid performance degradation, spurious escalation, or incoherent reasoning in crisis domains.
6. Benchmarking and Future Research Directions
WGSR-Bench (Yin et al., 12 Jun 2025) and allied benchmarks now provide composite, multi-domain evaluation of strategic reasoning across environment awareness, risk modeling, and policy generation. Elite human scores in policy synthesis (92.3) remain significantly ahead of GPT-4.1 (60.0), especially in coalition and sequential games. Catastrophic drops in context-synthesis and opponent-modeling suggest architectural limitations—lack of memory augmentation, insufficient agent-specific attention, and absence of explicit causal or counterfactual modules.
Continued progress requires:
- Multi-module architectures with memory-augmented transformers and agent-specific subnets
- Training regime diversification—adversarial, high-risk, and coalition-centric curricula
- Human–AI collaborative meta-architectures with LLMs supporting trait inference and scenario breadth, operationalized by humans for final decision synthesis (Yin et al., 12 Jun 2025, Solopova et al., 2 Mar 2026)
- Standardized, domain-level benchmarking with human-expert evaluation for open-ended qualitative wargames (Matlin et al., 21 Sep 2025, Hogan et al., 2024)
Methodological improvements in transparency, agent design, and real-time hybrid deployments are key to closing—though not erasing—the current human–machine gap in adversarial, creative, and high-order strategic wargaming.
7. Implications for Operational, Policy, and Research Practice
LLMs as wargame agents deliver substantial scalability and rapid scenario exploration—generating hundreds of plausible outcomes for complex, real-world strategic dilemmas in less than the time required for a single human session (Hogan et al., 2024). However, their limitations in adversarial reasoning, escalation calibration, and argumentative diversity preclude direct substitution for human expertise in military, crisis, or high-policy domains. Cautious operational integration—emphasizing augmentation over automation, with rigorous pre-deployment stress-testing and real-time human oversight—is strongly recommended by expert studies (Lamparth et al., 2024).
Research open problems include: robust long-horizon planning; counterfactual and multi-agent simulation at scale; systematized benchmarks for qualitative, human-in-the-loop evaluation; adversarial robustness; and methods for extracting faithful, interpretable rationales for regulatory and operational audit.
In summary, LLMs have become valuable, if limited, tools for wargaming analysis and experimentation. Their role is best conceived as accelerators of breadth, draftsmanship, and alternative mapping, with humans retaining ultimate authority for depth, adversarial nuance, and public accountability in strategic decision-making.