Werewolf Arena Framework

Updated 16 November 2025

Werewolf Arena Framework is a suite of systems combining algorithmic design, reinforcement learning, and modular role definitions for social deduction games.
It employs hybrid architectures integrating language-based inference, external reasoning, and multimodal inputs to handle deception and partial observability.
The framework sets standardized evaluation protocols with metrics such as win-rate control, role inference accuracy, and decision alignment to benchmark agent performance.

The Werewolf Arena Framework encompasses a suite of algorithmic, modular, and benchmarking systems aimed at evaluating and training LLM agents within the social deduction game Werewolf and its variants. The framework captures the central challenges—language-mediated inference, deception, coordination, and adaptive strategic play—and provides both agent design methodologies and standardized evaluation protocols. Its development reflects a progression from pure language-generation agents to hybrid architectures involving reinforcement learning, modular role specialization, external reasoning engines, and multimodal components.

1. Formal Structure and Problem Definition

The underlying game, Werewolf (and its subvariants such as One Night Ultimate Werewolf [ONUW]), is formalized as a multi-agent, asymmetric, partially observable environment. Each agent is privately assigned a role from a finite set (Werewolf, Villager, Seer, Witch, Hunter, Guard, etc.), with action spaces including language utterances, targeted skills (kills, investigations, saves, poisons), and voting. The state for each agent consists of their private knowledge, the evolving public transcript, observable actions, and—in advanced variants—retrieved memory, experience, and multimodal cues (expressions and prosody). The overall objective functions are team-dependent and asymmetric: Werewolves win by outnumbering or eliminating all non-wolves; villagers and specials win by eliminating all wolves.

Architecturally, most frameworks decompose agent behavior into sequential or parallel modules, typically:

Predictor: infers hidden roles from language and action traces.
Decider: generates the next action or strategic intent, often conditioned on agent beliefs, performance constraints, or a reward signal.
Discussor/Presenter: produces human-interpretable utterances aligned with Decider output.
Thinker module (in dual-system frameworks): performs systemic reasoning over structured state, with outputs guiding the language-generating modules.

Game progress is discretized into day and night phases, with public discussion, role-privileged actions, and voting, proceeding until a terminal win condition.

2. Agent Architectures: Modular and Hybrid Approaches

Recent frameworks implement specialized agent modules to address the core inference, planning, and communication demands:

DVM (Dynamic Victory Manager): Separates the pipeline into (i) Predictor (ChatGLM3-6B, fine-tuned on FanLang-9, further refined via self-play); (ii) Decider (policy head over role-action embeddings, with illegal-action masking and win-rate constraint); and (iii) Discussor (prompted, untrained ChatGLM3-6B). Only the Predictor and Decider are updated via RL and supervised learning. DVM uniquely enables explicit proficiency adjustment by using win-rate-constrained reward shaping. The explicit softmax action filtering and LoRA adapter-based fine-tuning enable efficient training and policy control (Zhang et al., 12 Jan 2025).
Language Agents with Reinforcement Learning: Each agent first deduces roles (LLM-parsed, with confidence and citation extraction), then generates diverse candidate actions/utterances, and finally invokes an RL policy (MAPPO) to select among these based on encoded context. The use of candidate diversity, RL-based selection, and population-based training mitigates bias and improves strategic robustness (Xu et al., 2023).
Dual-System Reasoning Frameworks: Here, a System 1 LLM is responsible for speech ingestion and generation (Listener/Presenter), while an external “Thinker” (System 2)—optimized via behavioral cloning and PPO—conducts discrete reasoning, role attribution, and strategic planning. Information is exchanged between modules via serialized “language feature” and “speech instruction” matrices, ensuring consistency and compliance with legal move sets (Wu et al., 2024).
Multimodal and Theory of Mind Modules: The MultiMind framework introduces a Perceiver module for extracting multimodal features (face, tone, verbal actions), a Transformer-based ToM block that models agent suspicions as a belief matrix, and an MCTS planner that searches for strategies minimizing suspicion. This approach leverages both neural and symbolic inference atop text, video, and audio signals (Zhang et al., 25 Apr 2025).

3. Reinforcement Learning and Policy Optimization

Werewolf Arena frameworks universally employ RL to move beyond static, prompt-driven LLM outputs:

Policy Optimization: Decider/Policy modules are updated using PPO, MAPPO, or conservative Q-learning (CQL). For DVM, the PPO objective incorporates win-rate-constrained reward shaping, per-step survival, and a decision-chain reward sourced from a database of human games.
Reward Engineering: Rewards range from raw win/loss signals (+100/–100), skill and deduction bonuses, and individualized behavioral scores, to fine-grained, chain-level outcome-based payments. DVM specifically introduces a reward term to penalize deviation from a user-specified win-rate, allowing explicit control over agent difficulty and performance variability (Zhang et al., 12 Jan 2025).
Candidate Generation and Selection: Agents may synthesize multiple candidate utterances per context, with RL-trained scoring layers selecting the most contextually advantageous move, thus decoupling language diversity from strategic optimization (Xu et al., 2023).
Curriculum and Population-Based Learning: Some frameworks incorporate gradual tightening of constraints or variable opponent pools (with hand-crafted “styles”) to robustify agent strategies and support curriculum-like developmental progression.

4. Evaluation Protocols and Benchmarking

Evaluation is grounded in tournament play, synthetic self-play, and, in some frameworks, human-agent trials. The main metrics include:

Role inference accuracy: ACC@k for werewolf or identity prediction (e.g., DVM achieves ACC@1=0.908 for werewolf, surpassing GPT-4 at 0.805).
Win-rate controllability: Empirical alignment of actual win rate with target rates under explicit constraints; only DVM demonstrates monotonic controllability (e.g., win rate tracking target from 0.2 to 0.8).
Speech and Decision Evaluation: Multiple-choice and alignment-based task sets, e.g. WereAlign/WereBench (Song et al., 13 Oct 2025), test agents’ ability to match human or expert strategies on social-deduction dimensions (role inference, deception, persuasion, etc.), and on concrete action choices (votes, inferred opponent identities).
Opinion leadership: The Sheriff framework quantifies reliability (R) and influence (I) of recommendation, highlighting LLMs’ limited capacity to shift peer actions in multi-agent settings (Du et al., 2024).

Experimental setups typically simulate 7–12 player boards with standard role distributions, various LLM agent types and versions, and alternation of team assignments over dozens to hundreds of games per configuration.

Performance Table (DVM vs. Baseline):

Agent/Camp	Werewolf Win Rate	Villager Win Rate	Other Role Win Rate
DVM	66.6%	63.3%	53.3%
Thinker (baseline)	63.3%	36.6%	36.6%

Emergent behaviors reported include trust formation, confrontation, camouflage, leadership, and, in stronger frameworks, self-calibration of skill level.

5. Extensibility, Modularity, and Best Practices

Frameworks such as WereWolf-Plus (Xia et al., 15 Jun 2025) emphasize extensibility:

Role definitions are modular (RoleConfig), with constraints, parameters, and allowed skills as explicit fields. New roles, custom team sizes, and board setups can be scripted in config files or DSLs.
Model assignment can be heterogeneous: different LLM endpoints per role, facilitating ablation studies and comparative benchmarks.
Retrieval-augmented generation (RAG) is supported for history and experience pool lookup, enabling more context-sensitive decision-making in large-scale simulations.
Evaluation pipelines are designed for multi-dimensional metrics, including skill effectiveness (SeerScore, WitchScore, etc.), player-oriented ability (IRP, KRE, VSS), and social influence (SheriffScore).
Experience pools and reflection mechanisms are used for non-parametric improvement, with retrieval based on similarity to current game state or belief summaries (Xu et al., 2023).

Best practices include enforcing game rules via masking, comprehensive logging, ablation diagnostics (decision-chain reward, predictor, diversity), and formalizing fairness auditing for intra-camp or cross-role imbalances.

6. Limitations and Open Challenges

Despite advances, several challenges remain:

Agent Ceiling: Most frameworks can only modulate win rates up to the intrinsic skill ceiling of the agent/architecture; safe, reliable performance control near maximal or minimal levels remains non-trivial (Zhang et al., 12 Jan 2025).
Reward Table Coverage: Decision-chain reward mechanisms depend on sufficiently dense coverage of historical strategy-outcome pairs; novel or rare strategies may be under-incentivized.
Limited Persuasion: Current LLMs struggle to sway peer agents solely via argument quality, reflecting both architectural and data limitations (Du et al., 2024).
Human vs. Agent Discrepancy: While some frameworks meet or exceed average human-level play in isolated metrics, qualitative gaps in deception and counterfactual reasoning persist, as revealed in fine-grained speech/decision alignment benchmarks (Song et al., 13 Oct 2025).

Proposed directions include explicit memory modules, multi-call pipelines to separate public/private reasoning, curriculum- and regret-based training paradigms, robust reward shaping, and cross-game extensibility (e.g., Avalon, Resistance, Diplomacy).

7. Summary Table of Key Frameworks

Framework	Modularization Approach	RL Paradigm	Distinctive Feature	Citation
DVM	Predictor/Decider/Discussor	PPO, Win Rate Control	Explicit performance shaping	(Zhang et al., 12 Jan 2025)
Language Agents + RL	LLM Deduce + Candidate + RL	MAPPO	Candidate diversity	(Xu et al., 2023)
MultiMind	Perceiver/ToM/MCTS/Actor	MCTS, ToM-guided	Multimodal, suspicion minimization	(Zhang et al., 25 Apr 2025)
Dual-System Reasoning	Listener/Thinker/Presenter	BC + PPO Thinker	External symbolic Reasoner	(Wu et al., 2024)
WereWolf-Plus	Configurable Role & LLMs	Flexible	Role extensibility, RAG, metrics	(Xia et al., 15 Jun 2025)
Alignment Evaluation	N/A (Benchmark)	N/A	Human-aligned, QA/decision eval	(Song et al., 13 Oct 2025)

Each of these frameworks operationalizes the vision of a "Werewolf Arena" as both a training ground for strategic, communicative LLMs and a methodological platform for rigorous, multi-faceted performance benchmarking in language-rich, incomplete-information environments.