Papers
Topics
Authors
Recent
Search
2000 character limit reached

DSGBench: Multi-Genre LLM Agent Benchmark

Updated 20 May 2026
  • DSGBench is a comprehensive evaluation platform that benchmarks LLM-based agents across six strategic game genres using a POMDP formalization.
  • It integrates automated trajectory tracking and fine-grained metrics to assess dimensions such as strategic planning, real-time decision-making, social reasoning, team collaboration, and adaptive learning.
  • Advanced extensions like WereWolf-Plus demonstrate its capability for role-specific evaluations and retrieval-augmented reasoning to diagnose nuanced agent performance.

DSGBench ("Diverse Strategic Game Benchmark") is a comprehensive evaluation platform designed to rigorously assess LLM-based agents in complex, multi-faceted decision-making environments. Its architecture spans six distinct genres of strategic games, integrating fine-grained metrics and an automated trajectory tracking mechanism to supply multidimensional diagnostics of agent capability. The benchmark has served as the foundational substrate for advanced extensions such as WereWolf-Plus, which further augments the granularity and extensibility of LLM agent evaluation in multi-agent settings (Tang et al., 8 Mar 2025, Xia et al., 15 Jun 2025).

1. System Architecture and Formalization

DSGBench models each game instance as a Partially Observable Markov Decision Process (POMDP), utilizing the tuple W,S,A,O,T\langle W, S, A, O, T \rangle, where WW denotes victory conditions or strategic objectives, SS the (partial) game state, AA the legal action space, OO the agent's observation space, and TT the environment's state transition function. Two distinct agent reasoning protocols are supported:

  • Single-level reasoning, with trajectory likelihood pπ(τ)=p(s0)t=0T1p(atst,ft)T(st+1st,at,ft)p_\pi(\tau) = p(s_0)\prod_{t=0}^{T-1}p(a_t|s_t,f_t)\,T(s_{t+1}|s_t,a_t,f_t).
  • Two-level reasoning, separating high-level plan generation under global context from subsequent low-level action refinement, as formalized in pπ(τ)p_\pi(\tau) accordingly.

The architectural core comprises three modules:

  • GameManager: Coordinates game execution, agent invocation, and logging.
  • GameEnv: Standardizes game APIs (following Gym conventions), state/action spaces, and rule enforcement.
  • HistoryTracker: Records all agent decisions, observations, and corresponding environment feedback for downstream metric calculations and behavior analysis.

2. Strategic Game Coverage and Task Parameterization

DSGBench encompasses six games, each chosen to probe orthogonal cognitive faculties:

Game Title Genre/Emphasis Key Features
StarCraft II Real-Time Strategy (RTS) Partial observability, macro/micro management, economy vs. combat tradeoffs
Civilization Turn-Based 4X Multi-objective, large action/state space, long-term planning
Street Fighter III Real-Time Fighting Reflex/testing, sequence optimization, limited context
Diplomacy Social Strategy/Negotiation Hidden intentions, alliance formation, text negotiation
Werewolf Social Deduction/Team Reasoning Private roles, iterative inference, cooperative/antagonistic play
Stratego Hidden-Info Board Game Uncertainty, deception, asymmetric information

Each game instance supports scenario-level customization of difficulty, agent roles, parameter spaces (e.g., map, team composition), and access to individualized prompting templates, thereby facilitating systematic stress-testing along targeted axes. For instance, StarCraft II allows evaluation under "macro" vs. "rush" opponent strategies, and Werewolf supports flexible player/role configurations.

3. Fine-Grained Evaluation Metrics

DSGBench introduces a unified, normalized scoring formalism across five core cognitive dimensions:

  • Strategic Planning: Production/economy in RTS (e.g., Resource Per Minute), expansion in 4X, control in Diplomacy and Stratego.
  • Real-Time Decision-Making: APM/EPM in RTS, attack and combo rates in fighters.
  • Social Reasoning: Social inference precision (e.g., IRP in Werewolf, betrayal rates in Diplomacy).
  • Team Collaboration: Alliance formation/duration (Diplomacy), role coordination (Werewolf: KSR, VSS).
  • Adaptive Learning: Cross-scenario win rate, context/grounding accuracy.

The overall capability score TT aggregates normalized component metrics:

T=i=1mWiβi(j=1nwj(1/kj)k=1kjRyjkminjRyjmaxjRyjminjRyj)T = \sum_{i=1}^m W_i \cdot \beta_i \cdot \left( \sum_{j=1}^n w_j \cdot \frac{ (1/k_j)\sum_{k=1}^{k_j} R_{y_{j_k}} - \min_j R_{y_j} }{ \max_j R_{y_j} - \min_j R_{y_j}} \right)

Each sub-task in a game (run WW0 of scenario WW1, metric WW2) is weighted locally and globally.

Specific to Werewolf, three metrics were originally defined, all normalized in WW3:

  • Identification Recognition Proficiency (IRP):

WW4

  • Key Role Survival Rate (KSR):

WW5

  • Voting Selection Score (VSS):

WW6

4. Automated Decision Trajectory Tracking

To transcend static end metrics, DSGBench logs complete decision trajectories. At every time step, the system records:

  • Action Type (e.g., "TRAIN PROBE", "VOTE: X")
  • Context (WW7 snapshot, current goal)
  • Immediate and eventual outcomes (metric deltas, reward propagation)

This dataset enables analysis of:

  • Decision frequency by action class.
  • Context/action–outcome correlation (e.g., whether a defensive build is coupled to subsequent improvement in EPM).
  • Anomaly detection by surfacing irregular sequences—such as contradictory loops or regressive play.

This design supports diagnosis of agent failure modes and comparative benchmarking under tight experimental control (Tang et al., 8 Mar 2025).

5. Extensibility and Recent Developments: WereWolf-Plus

WereWolf-Plus (Xia et al., 15 Jun 2025) is an extension platform rooted in DSGBench, targeting deeper evaluation of LLM-based social and strategic reasoning in the Werewolf game:

  • Expanded Role Set: In addition to Seer and Guard (Doctor), the platform now supports Witch (limited-use heal/poison), Hunter (posthumous shot), and Sheriff (elevated voting/initiative powers).
  • Flexible Model Assignment: Each role can be independently mapped to distinct LLMs (e.g., GPT-4o-mini, Deepseek-V3), enabling heterogeneity analyses across agent architectures and head-to-head matchups.
  • Retrieval-Augmented Reasoning: Integration of an "Experience Pool" employing RAG exclusively during voting stages. At each vote, current debate summaries are compared (using cosine similarity, threshold WW8) to past reward-annotated rounds, informing decision making.
  • Expanded Metrics: WereWolf-Plus introduces role-specific quantitative functions (e.g., Seer Accuracy, Witch Effectiveness, Hunter Kill Yield, Guard Protection Quality, Sheriff Influence, Werewolf Survival Strategy), as well as a composite Key Role Skill Effectiveness (KRE). All are normalized to WW9 and admit closed-form LaTeX expressions.

For example, Witch Effectiveness is:

SS0

6. Experimental Protocols and Comparative Results

  • Setup: Test configurations include 10-game series on both 12-player (no sheriff) and 8-player (with sheriff) boards, allocating roles to LLMs from a defined pool (GPT-4o-mini, Deepseek-V3, Doubao).
  • Retrieval Module: Voting-phase RAG uses the multi-qa-mpnet-base–cos-v1 embedding backbone for similarity estimation.
  • Outcomes: Deepseek-V3 attains superior IRP (SS1), KRE (SS2), and outperforms other models across nearly all dimensions; GPT-4o-mini consistently records lower scores. Use of the experience pool yields measurable gains in VSS (from SS3 to SS4 for Doubao) and enhances sheriff influence.
  • Granularity: The expanded role metrics in WereWolf-Plus reveal subtleties in skillful play unattainable with the original DSGBench two-role metrics, underscoring the value of role-specific and aggregated reporting for future research into LLM-based agent social intelligence.

7. Significance and Research Implications

DSGBench, by unifying and formalizing cross-genre strategic game evaluation for LLM-based agents, inaugurates a domain-general, multi-dimensional analytical paradigm. Its logging and metric infrastructure facilitates both agent diagnosis and the development of hybrid architectures. WereWolf-Plus demonstrates how domain-specific extension can sharpen this diagnostic power, enabling precise evaluation of role-bound skills and the effect of retrieval-based reasoning augmentation.

Empirical findings suggest that RAG-based enhancements and role-heterogeneous LLM assignment expose important gradients in agent performance and cooperation. No current LLM architecture dominates all DSGBench dimensions; hybrid or modular training (e.g., division between high-level reasoning and low-level tactics) appears necessary to approach human-like proficiency in complex interactive environments (Tang et al., 8 Mar 2025, Xia et al., 15 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DSGBench.