WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench (2506.12841v1)

Published 15 Jun 2025 in cs.AI and cs.CL

Abstract: With the rapid development of LLM-based agents, increasing attention has been given to their social interaction and strategic reasoning capabilities. However, existing Werewolf-based benchmarking platforms suffer from overly simplified game settings, incomplete evaluation metrics, and poor scalability. To address these limitations, we propose WereWolf-Plus, a multi-model, multi-dimensional, and multi-method benchmarking platform for evaluating multi-agent strategic reasoning in the Werewolf game. The platform offers strong extensibility, supporting customizable configurations for roles such as Seer, Witch, Hunter, Guard, and Sheriff, along with flexible model assignment and reasoning enhancement strategies for different roles. In addition, we introduce a comprehensive set of quantitative evaluation metrics for all special roles, werewolves, and the sheriff, and enrich the assessment dimensions for agent reasoning ability, cooperation capacity, and social influence. WereWolf-Plus provides a more flexible and reliable environment for advancing research on inference and strategic interaction within multi-agent communities. Our code is open sourced at https://github.com/MinstrelsyXia/WereWolfPlus.

Summary

The paper introduces a multi-agent testbed, WereWolf-Plus, that provides refined role-specific metrics and enhanced evaluation compared to DSGBench.
It employs diverse roles and modular configurations to assess AI agents’ social reasoning, cooperation, and strategic decision-making.
Experimental results show that the Deepseek-V3 model outperforms other LLMs through effective experience-retrieval augmentation strategies.

Highlights of "WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench"

The paper "WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench" introduces an advanced multi-agent testbed, WereWolf-Plus, designed to evaluate the strategic reasoning of LLM-based agents in complex social settings. WereWolf-Plus extends the limitations of existing platforms by providing a multi-model, multi-dimensional evaluation of agents engaged in the Werewolf game. Using a framework that supports heterogeneous role assignments and enriched evaluation metrics, the paper aims to assess key capabilities such as social reasoning, cooperation, and strategic influence.

Framework Overview and Features

The foundation of WereWolf-Plus is a robust simulation environment that adheres closely to the standard roles and rules of the Werewolf game. The framework facilitates scenarios with varied roles such as Seer, Witch, Hunter, Guard, and Sheriff, thus enabling a nuanced analysis of agents' skills across differing roles.

Figure 1: The Complete Game Flow of WereWolf-Plus.

Key features include role-specific evaluations and agent models configured to reflect a spectrum of reasoning and cooperation strategies. Crucially, the simulation environment is highly extensible, supporting a wide array of configurations and customized model assignments.

Quantitative Evaluation Metrics

The paper proposes distinct character-oriented and player-oriented metrics that effectively capture the intricate dynamics of Werewolf gameplay. Metrics include:

Seer Success Rate: Tracks how often the Seer identifies a werewolf.
Witch Effectiveness: Assesses the Witch's decisions in saving or eliminating players.
Sheriff Influence Score: Measures the change in player votes influenced by the sheriff's recommendations.

These metrics offer comprehensive insights into role-specific performances, aiding in identifying strengths and weaknesses in the agents’ decision-making processes.

Improvements Over DSGBench

WereWolf-Plus addresses significant gaps identified in previous benchmarks such as DSGBench. It ensures enhanced evaluation depth through:

Complete inclusion of additional roles and game configurations.
A set of refined assessment metrics tailored for each role, improving diagnostic precision.
Improved modularity and rule adherence, thereby ensuring robustness and accuracy of simulations.

Experimental Insights

Experiments conducted demonstrated that the Deepseek-V3 model performs consistently better across varied roles when compared to other LLM models such as Doubao and GPT-4o-mini. The use of an experience-retrieval augmentation mechanism effectively boosted agents' strategic reasoning by leveraging historical interaction data to inform future decision-making.

Conclusion and Future Directions

WereWolf-Plus stands as a critical advancement for testing LLM agents within a social deduction game environment. By enabling precise evaluation and strategic refinement, this platform opens avenues for exploring cooperative and competitive interactions in AI. Future research could expand by incorporating more diverse strategic scenarios, exploring cross-agent dynamics, and refining retrieval strategies for enhanced decision-making.

In summary, the paper provides a comprehensive framework for evaluating AI's social intelligence, offering a path forward in developing agents capable of nuanced interaction and strategic reasoning.