Adversarial PyMARL Library

Updated 25 December 2025

APyMARL is a modular framework for benchmarking and advancing deep multi-agent reinforcement learning in adversarial scenarios using StarCraft II.
It features a three-layer architecture including environment wrappers, adversarial training controllers, and logging modules to ensure fairness and reproducibility.
The library supports dual and mixed adversarial modes, integrates various MARL algorithms, and uses unified YAML/JSON configuration for rapid experimentation.

The Adversarial PyMARL (APyMARL) library is a modular research framework dedicated to benchmarking and advancing deep multi-agent reinforcement learning (MARL) in adversarial settings. Developed in conjunction with the StarCraft II battle arena (SC2BA) environment, APyMARL enables algorithm-vs-algorithm evaluation, targeting scenarios that move beyond fixed built-in AI opponents by supporting both dual and mixed adversarial paradigms. APyMARL offers standardized interfaces for scenario definition, training, and evaluation of classic and novel MARL algorithms, along with explicit support for fairness, reproducibility, and extensibility (Li et al., 18 Dec 2025).

1. System Structure and Workflow

APyMARL is architected around three principal layers: environment wrappers, adversarial training controllers, and configuration/IO/logging. The workflow is as follows. The Configurator parses a unified YAML or JSON configuration to instantiate the SC2BAEnv environment and the relevant Trainer, which may be DualTeamTrainer (pairwise live algorithm competition) or MixedTeamTrainer (rotation among a pool of pre-trained opponents). Each training loop involves resetting the environment, collecting observations, interfacing with two policies, stepping with both teams' actions, storing transitions, updating policies, and invoking the DataCollector for benchmarking outputs.

Layer	Main Components	Functionality
Configurator	ConfigParser	Experiment parsing, scenario instantiation
Environment Wrappers	SC2BAEnv, standardization wrappers	Observation, action handling; built on PySC2
Training Controllers	DualTeamTrainer, MixedTeamTrainer	Orchestrate adversarial self-play/self-testing
Logging & Checkpointing	TensorBoard, CSV, model snapshot utility	Metric/baseline logging, reproducibility

Key interactions explicitly adhere to the structure above: Configurator provisions the environment and trainer, which run the adversarial loops, while DataCollector and Evaluator standardize benchmarking and analysis.

2. Core Modules and API Interfaces

APyMARL is trifold in its extensible Python package organization:

env.SC2BAEnv: Provides a StarCraft II multi-agent simulation with custom maps, two-sided API (reset(), step(actions_red, actions_blue)), reward shaping, partial/full observability, and seeded stochasticity. Properties expose Discrete(6) action spaces and matching observation spaces for both red and blue teams.
trainer.DualTeamTrainer / MixedTeamTrainer: Encapsulate agent policy orchestration. DualTeamTrainer trains two live algorithms head-to-head, while MixedTeamTrainer samples adversary models across episodes for robust mixed-behavior testing.
algorithms: Implements QMIX, VDN, QPLEX, QTRAN (value-based), COMA, IQL, FOP, DOP (policy-based). All derive from BaseAlgo, which standardizes select_actions, store_transition, train, save, and load interfaces.
utils: ConfigParser validates YAML/JSON scenario definitions; DataCollector tracks and logs all episodic metrics; Evaluator facilitates batch evaluation and reporting.

A minimal instantiation involves using ConfigParser for scenario definition, constructing the environment and algorithms, and passing these to the trainers. Training API exposes granular access to episodic training, evaluation, and model persistence as per the design.

3. Scenario Specification and Adversarial Modes

All APyMARL experiments are specified through a single YAML or JSON. Two primary adversarial paradigms are supported:

Dual-Algorithm Paired Adversary: Both red and blue teams are governed by independent, live-learning algorithms, facilitating pure algorithm-vs-algorithm research. Example YAML:

env:
  map_name: "3m"
  adversary_mode: "dual"
  max_steps: 200
agents:
  red:
    algorithm: "QMIX"
    hyperparams:
      lr: 0.0005
      gamma: 0.99
  blue:
    algorithm: "COMA"
    hyperparams:
      lr: 0.0007
      gamma: 0.99
training:
  total_steps: 10000000
  eval_interval: 50000
  seed: 42

Multi-Algorithm Mixed Adversary: The red team learns against a randomized or strategically-rotated pool of pre-trained opponents. A typical configuration specifies the pool as a path list and a mixing strategy (e.g., uniform random selection).

env:
  map_name: "MMM"
  adversary_mode: "mixed"
  opponent_pool:
    - "models/qmix_3m.pt"
    - "models/vdn_3m.pt"
    - "models/coma_3m.pt"
  mixing_strategy: "uniform"
  max_steps: 200
agents:
  red:
    algorithm: "DOP"
    hyperparams:
      lr: 0.0003
      gamma: 0.99
training:
  total_steps: 2000000
  eval_interval: 20000
  seed: 123

This schema enables reproducible, custom, and extensible adversarial benchmarking.

4. Supported Algorithms and Loss Formulations

APyMARL incorporates canonical mixing and policy-gradient-based MARL algorithms, directly exposing loss formulations:

QMIX-style critic loss:

$L_{\rm critic} = \mathbb{E}_{(s,\mathbf{u},r,s')}\Big[\big(r + \gamma \max_{\mathbf{u}'} Q_{\rm tot}(s', \mathbf{u}') - Q_{\rm tot}(s, \mathbf{u})\big)^2\Big]$

COMA-style actor loss (counterfactual advantage):

$L_{\rm actor}^{i} = -\mathbb{E}_{\pi^i}\Big[A^{i}(o^i,a^i)\log \pi^i(a^i|o^i)\Big]$

$A^{i}(o,a) = Q_{\rm tot}(s,(a^i,\mathbf{a}^{-i})) - \sum_{a'^i}\pi^i(a'^i|o^i)\,Q_{\rm tot}(s,(a'^i,\mathbf{a}^{-i}))$

Fairness penalty (optional regularizer for per-agent reward equality):

$R_{\rm fairness} = \frac{1}{N}\sum_{i=1}^N\Big(r_i - \bar{r}\Big)^2$

Included into actor or critic losses via a hyperparameter $\lambda$ .

DOP loss (tree backup with $\lambda$ -return):

$L = \mathbb{E} \big[\big(y^{(\lambda)} - Q(o,a)\big)^2\big]$

where $y^{(\lambda)}$ is the TD( $\lambda$ ) return.

This formulation catalog enables direct method comparison under genuinely adversarial testbeds.

5. Installation, Environment, and Example Usage

APyMARL requires a Linux environment with StarCraft II v4.6.2.6923, Python ≥3.7, PyTorch ≥1.9.0, and dependencies including PySC2 and Blizzard s2client-proto. Setup proceeds as:

git clone https://github.com/dooliu/SC2BA.git
cd SC2BA
pip install -r requirements.txt
python setup.py install

A minimal training workflow (e.g., dual adversary QMIX vs COMA) involves composing config files, instantiating environment and algorithms, and invoking the trainer:

from apymarl.config import ConfigParser
from apymarl.env import SC2BAEnv
from apymarl.trainer import DualTeamTrainer
from apymarl.algos.qmix import QMIX
from apymarl.algos.coma import COMA

config = ConfigParser("configs/dual_qmix_coma_3m.yaml").to_dict()
env = SC2BAEnv(
    map_name=config['env']['map_name'],
    adversary_mode=config['env']['adversary_mode'],
    max_steps=config['env']['max_steps']
)
red_algo = QMIX(n_agents=3, obs_dim=37, act_dim=6, **config['agents']['red']['hyperparams'])
blue_algo = COMA(n_agents=3, obs_dim=37, act_dim=6, **config['agents']['blue']['hyperparams'])
trainer = DualTeamTrainer(env, red_algo, blue_algo, config['training'])
trainer.train(num_steps=config['training']['total_steps'])
trainer.save_models("checkpoints/qmix_red.pt", "checkpoints/coma_blue.pt")

TensorBoard, CSV, and model snapshot logging are available throughout.

6. Benchmarking Results and Computational Considerations

Comprehensive baselines are provided for eight SMAC scenarios in both adversary modes, with the following summary for dual adversary (symmetric maps):

Algorithm	Avg. Win Rate	#Scenarios Won	Converge Steps (Million)
DOP	0.65	4 / 7	8.0
QMIX	0.63	2 / 7	9.0
QPLEX	0.60	0 / 7	10.0
VDN	0.58	0 / 7	9.5
FOP	0.57	1 / 7	9.2
QTRAN	0.52	0 / 7	10.0
COMA	0.50	0 / 7	10.0
IQL	0.45	0 / 7	10.0

In mixed adversary mode (ten maps), DOP again leads with a ~0.62 average win-rate. Empirical convergence time is 30–40 minutes per million steps on an NVIDIA Tesla V100, dependent on map complexity (3m vs 25m) (Li et al., 18 Dec 2025).

7. Customization, Extension, and Best Practices

Scenario Customization: Rapid prototyping via edits to map_name, env.max_steps, and unified map layouts in sc2ba/maps/.
Hyperparameter Tuning: Recommended exploration in learning rates [1e-4, 1e-3], entropy coefficient [0, 0.01], and discount factor [0.95, 0.99].
Fairness / Regularization: Toggle reward_shaping: true in configs, or experiment with $\lambda_{\rm fairness}$ direct regularization.
Novel Maps / Asymmetric Layouts: Integrate via SC2BAEnv._map_registry and adhere to symmetric spawn rules for statistical fairness.
Algorithm Extension: Inherit and implement new algorithms via BaseAlgo, update API registration, and expose via YAML interface.
Future Directions: Dynamic mixed adversary settings, with both teams evolving, are suggested as an open extension by registering custom trainers.

The modular, standardized, and open architecture situates APyMARL as a research-oriented and extensible testbed for adversarial multi-agent learning, underpinned by explicit reproducibility and benchmarking design (Li et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

StarCraft+: Benchmarking Multi-agent Algorithms in Adversary Paradigm (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Adversarial PyMARL (APyMARL) Library.