Best LLM role-play granularity for wargame simulations

Determine whether instructing large language models (such as GPT-3.5 or GPT-4) to simulate a team collectively, to simulate multiple characters jointly, or to role-play as individual agents with nuanced roles yields better results for wargame experiments that compare LLM-simulated decisions against human expert teams in a U.S.–China National Security Council crisis scenario.

Background

The paper compares decisions made by 107 national security experts in a U.S.–China crisis wargame to those produced by LLM-simulated teams, finding broad similarities but also systematic differences in specific actions, aggressiveness, and dialogue quality.

The authors observe that LLM-simulated dialogues are unnaturally harmonious and insensitive to player background attributes, and that outcomes are sensitive to prompt design choices, including whether dialogue is simulated. These observations motivate the unresolved question of which prompting setup—team-level simulation, multi-character simulation, or individual agent role-play—produces superior results in similar wargaming experiments.

References

Also, it is still unclear whether tasking the LLMs to simulate a player team, a combination of characters, or to role-play as individuals with a more nuanced view would yield better results for similar experiments \citep{Shanahan2023}.

Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations  (2403.03407 - Lamparth et al., 2024) in Section: Discussions, paragraph 2