StarCraft II Battle Arena (SC2BA)

Updated 25 December 2025

SC2BA is a multi-agent algorithm-versus-algorithm benchmarking environment designed to evaluate MARL and LLM agents in dynamic, adversarial StarCraft II settings.
It features a modular architecture with configuration, interaction, and control modules that ensure fairness, customizability, and extensive state-action representations.
SC2BA supports dual and mixed adversarial training modes, enabling rigorous evaluation of policy co-evolution, robustness, and generalization across diverse competitive scenarios.

StarCraft II Battle Arena (SC2BA) is a rigorously designed multi-agent algorithm-versus-algorithm (AvA) benchmarking environment for StarCraft II, created to facilitate and objectively evaluate progress in multi-agent reinforcement learning (MARL) and generalist LLM agents under adversarial and complex real-time decision-making conditions. It addresses crucial limitations in existing benchmarks by enforcing strict fairness, usability, and customizability while incorporating a broad action and state space, agent-vs-agent protocols, and explicit support for dynamic strategy co-evolution and generalization assessment (Shen et al., 14 Aug 2025, Li et al., 18 Dec 2025).

1. Motivations and Benchmarking Limitations

Standard MARL environments, notably SMAC, typically pit algorithms against fixed built-in AIs, leading to evaluations that lack competitive diversity and the adaptive complexity necessary for real progress in adversarial intelligence. Likewise, prior LLM-in-the-loop StarCraft II systems suffer from constraints such as reduced action spaces, limited race and map support, or absence of agent-vs-agent gameplay. SC2BA’s creation is motivated by the need for:

True adversarial benchmarking: Training and evaluating algorithms in environments where both sides adapt, exploiting weaknesses and demand continuous strategy revision.
Fairness and comparability: Ensuring matched combat power, spatial symmetry, and identical partial observability for unbiased evaluation.
Configurable, extensible scenarios: Allowing systematic study of algorithmic performance across symmetric/asymmetric team compositions, heterogeneous unit types, and task complexities.
Robustness and policy diversity measurement: Assessing not only win rates but also adaptability, scenario dominance, and generalization to new opponents (Li et al., 18 Dec 2025).

2. SC2BA Environment and Architecture

The SC2BA environment is an open-source AvA extension to SMAC, architected around three core modules:

Configuration Module: Parses YAML/JSON scene files to generate balanced teams, unit parameters (health, shields, attack, etc.), and assigns MARL controllers by team. A unified, text-editable map template allows rapid modification of scenarios—changing agent numbers, unit types, or initial regions with minimal friction.
Interaction Module: Provides a standard OpenAI Gym-style API, exposing step and reset methods. Each call processes paired team actions, returns per-agent local observations, action masks, and global state for training. The symmetric structure guarantees impartial initial conditions and observability.
Bottom-Level Control Module: Wraps the native StarCraft II binary (via s2client-proto), translating discrete agent actions—movement directions, attack macros, no-op—into native game commands and handling full episode lifecycle without rendering overhead.

On top of these, the adversarial PyMARL (APyMARL) library supplies:

A universal MARL interface abstracting environment- and agent-specific details,
Dual-team adversarial controllers, permitting independent and simultaneous policy deployment and logging,
Flexible experiment configuration for automated parameter sweeps and scenario diversity (Li et al., 18 Dec 2025).

3. Adversarial Training Modes

SC2BA enables two principal AvA MARL schemes:

Dual-Algorithm Paired Adversary: Two learning agents (A and B) are trained in repeated head-to-head matches. Each policy receives immediate rewards of the form

$r_t^i = \alpha \cdot (\text{damage\_dealt}_t^i) + \beta \cdot (\text{kills}_t^i) - \gamma \cdot (\text{self\_damage}_t^i) - \delta \cdot (\text{death\_penalty}),$

with terminal outcomes penalizing losses or draws. The paired mode induces dynamic co-evolutionary learning, requiring strategic and tactical adaptation over millions of environment steps.

Multi-Algorithm Mixed Adversary: The red team is trained online while blue cycles through a set of $n$ fixed pre-trained policies (e.g., QMIX, VDN, QPLEX, QTRAN, COMA, IQL, FOP, DOP, AI). Each episode, a random blue policy is selected, and the red agent must optimize against a stochastic mixture of adversaries. This incentivizes generalized, robust policy formation and efficient sampling (typically $\sim2$ million steps suffice).

Both protocols support automated cross-validation, per-algorithm tournaments, and rigorous agent-versus-agent Elo tracking (Li et al., 18 Dec 2025).

4. Observations, State Representations, and Action Spaces

When SC2BA is used as a substrate for LLM-based agents via SC2Arena, the state space is converted as follows:

Structured textual observations $o(s)$ are deterministically derived from underlying Markovian state $s$ (feature layers, unit/resource lists, etc.), following:

$o(s) = \texttt{Summarize}\big(\, s.\text{resources},\; s.\text{units},\; s.\text{structures},\; ...\big)$

Key abstraction techniques:

Proximity-based unit ordering: Units are greedily ordered by a nearest-neighbor heuristic starting at the main base (TSP_greedy), encoding adjacency and spatial relations.
Worker aggregation: Homogeneous worker units are grouped by type, reporting only count and average health, thereby streamlining high-frequency state elements.
- Low-level action space: Every StarCraft II action is directly exposed as a structured JSON command (build, train, attack, move, etc.):

$a = \{\; \text{"action"} : A,\; \text{"units"} : [u_1,...,u_k],\; \text{"target\_unit"} : u_t \;(\mathrm{optional}),\; \text{"target\_position"} : [x,y] \;(\mathrm{optional}) \;\}$

The resultant API exposes roughly 300 distinct, atomic StarCraft II actions, capturing the full micro-macro decision structure and enabling macro/micro interplay, partial observability, and complex tech-tree reasoning (Shen et al., 14 Aug 2025).

5. Core Metrics and Evaluation Protocols

SC2BA and SC2Arena support a suite of explicit evaluation metrics:

Metric	Definition / Purpose	Modality
Win Rate (WR)	$\#\mathrm{wins}_A / \#\mathrm{matches}_A$	MARL, LLM
Elo rating	Agent-vs-agent strength measurement, updated after each match	LLM
Scenario Dominance Count	$\sum_{s=1}^{K} \mathbf{1}(\mathrm{WinRate}_A(s)\ge\max_{B\neq A}\mathrm{WinRate}_B(s)+\epsilon)$	MARL
Resource Utilization Ratio (RUR)	Proportion of spent to acquired resources	LLM
Supply Block Ratio (SBR)	Fraction of time supply capped	LLM
Tokens per Decision (TPD)	LLM efficiency (text-based only)	LLM
Valid Action Ratio (VAR)	Fraction of syntactically and contextually valid actions	LLM

A plausible implication is that the dominance and robustness of policies can be systematically probed across both micro-battles and long-horizon, full-length games, especially when exploiting scenario heterogeneity, asymmetric force allocation, and extensive cross-play (Shen et al., 14 Aug 2025, Li et al., 18 Dec 2025).

6. Empirical Outcomes and Algorithmic Insights

Extensive benchmarks with SC2BA, including matched dual-algorithm and mixed-opponent adversary settings, reveal several robust findings:

Co-evolutionary volatility: Dynamic opponent strategies induce win-rate fluctuations; no method achieves persistent dominance, but value-based learners (VDN, QMIX, QPLEX) tend toward greater long-term stability, while policy-gradient methods display sharper but less sustained peaks.
Scenario sensitivity: Heterogeneous scenarios (e.g., MMM, 3s5z, 25m) expose more pronounced differences between classes of MARL methods than symmetric ones; even a single unit disadvantage dramatically reduces performance.
Policy diversity and generalization: AvA-trained agents, even when subsequently evaluated against static AIs, consistently outperform counterparts trained in single-sided regimes. PCA analysis of action distributions confirms that adversarial training fosters higher diversity in policies, correlating with broad robustness.
LLM-based agent intelligence: In SC2Arena/StarEvolve experiments, Qwen2.5 models achieve significant win rates and action validity improvements—from 55% WR/48% VAR to 71% WR/~85% VAR post-supervised fine-tuning. The verifier-driven self-correction loop prompts these gains by selectively focusing fine-tuning on high-impact, verified trajectories (Shen et al., 14 Aug 2025, Li et al., 18 Dec 2025).

7. Recommendations and Future Directions

Adopting SC2BA for advanced research entails:

Scenario configuration: Emphasize hard and heterogeneous maps for distinguishing algorithmic capabilities; use symmetric maps only for diagnostics.
Evaluation rigor and extensibility: Deploy all metrics, including scenario dominance and VAR, to comprehensively delineate agent competence.
Hierarchical agent structures: Separate high-level strategic planning from low-level execution, using independent modules (as in StarEvolve's Planner-Executor-Verifier pipeline) to boost both tactical correctness and long-horizon coherence.
Algorithmic robustness: Promote generalist behaviors via adversarial multi-opponent pools, curriculum self-play, and explicit handling of asymmetric forces and partial observability.
Research extensibility: The architecture’s modularity makes it feasible to extend SC2BA to cover richer asymmetries (terrain, vision) and integrate new fairness indices, such as resource-normalized win rates.

This suggests that SC2BA and its associated tools (such as APyMARL, SC2Arena, and StarEvolve) mark a significant advance in the precise, adversarial benchmarking of both MARL and LLM-based agents, providing a reproducible, extensible, and highly discriminative arena for the next generation of AI systems (Shen et al., 14 Aug 2025, Li et al., 18 Dec 2025).