Adversarial PyMARL Library
- APyMARL is a modular framework for benchmarking and advancing deep multi-agent reinforcement learning in adversarial scenarios using StarCraft II.
- It features a three-layer architecture including environment wrappers, adversarial training controllers, and logging modules to ensure fairness and reproducibility.
- The library supports dual and mixed adversarial modes, integrates various MARL algorithms, and uses unified YAML/JSON configuration for rapid experimentation.
The Adversarial PyMARL (APyMARL) library is a modular research framework dedicated to benchmarking and advancing deep multi-agent reinforcement learning (MARL) in adversarial settings. Developed in conjunction with the StarCraft II battle arena (SC2BA) environment, APyMARL enables algorithm-vs-algorithm evaluation, targeting scenarios that move beyond fixed built-in AI opponents by supporting both dual and mixed adversarial paradigms. APyMARL offers standardized interfaces for scenario definition, training, and evaluation of classic and novel MARL algorithms, along with explicit support for fairness, reproducibility, and extensibility (Li et al., 18 Dec 2025).
1. System Structure and Workflow
APyMARL is architected around three principal layers: environment wrappers, adversarial training controllers, and configuration/IO/logging. The workflow is as follows. The Configurator parses a unified YAML or JSON configuration to instantiate the SC2BAEnv environment and the relevant Trainer, which may be DualTeamTrainer (pairwise live algorithm competition) or MixedTeamTrainer (rotation among a pool of pre-trained opponents). Each training loop involves resetting the environment, collecting observations, interfacing with two policies, stepping with both teams' actions, storing transitions, updating policies, and invoking the DataCollector for benchmarking outputs.
| Layer | Main Components | Functionality |
|---|---|---|
| Configurator | ConfigParser | Experiment parsing, scenario instantiation |
| Environment Wrappers | SC2BAEnv, standardization wrappers | Observation, action handling; built on PySC2 |
| Training Controllers | DualTeamTrainer, MixedTeamTrainer | Orchestrate adversarial self-play/self-testing |
| Logging & Checkpointing | TensorBoard, CSV, model snapshot utility | Metric/baseline logging, reproducibility |
Key interactions explicitly adhere to the structure above: Configurator provisions the environment and trainer, which run the adversarial loops, while DataCollector and Evaluator standardize benchmarking and analysis.
2. Core Modules and API Interfaces
APyMARL is trifold in its extensible Python package organization:
- env.SC2BAEnv: Provides a StarCraft II multi-agent simulation with custom maps, two-sided API (
reset(),step(actions_red, actions_blue)), reward shaping, partial/full observability, and seeded stochasticity. Properties expose Discrete(6) action spaces and matching observation spaces for both red and blue teams. - trainer.DualTeamTrainer / MixedTeamTrainer: Encapsulate agent policy orchestration. DualTeamTrainer trains two live algorithms head-to-head, while MixedTeamTrainer samples adversary models across episodes for robust mixed-behavior testing.
- algorithms: Implements QMIX, VDN, QPLEX, QTRAN (value-based), COMA, IQL, FOP, DOP (policy-based). All derive from BaseAlgo, which standardizes
select_actions,store_transition,train,save, andloadinterfaces. - utils: ConfigParser validates YAML/JSON scenario definitions; DataCollector tracks and logs all episodic metrics; Evaluator facilitates batch evaluation and reporting.
A minimal instantiation involves using ConfigParser for scenario definition, constructing the environment and algorithms, and passing these to the trainers. Training API exposes granular access to episodic training, evaluation, and model persistence as per the design.
3. Scenario Specification and Adversarial Modes
All APyMARL experiments are specified through a single YAML or JSON. Two primary adversarial paradigms are supported:
- Dual-Algorithm Paired Adversary: Both red and blue teams are governed by independent, live-learning algorithms, facilitating pure algorithm-vs-algorithm research. Example YAML:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
env: map_name: "3m" adversary_mode: "dual" max_steps: 200 agents: red: algorithm: "QMIX" hyperparams: lr: 0.0005 gamma: 0.99 blue: algorithm: "COMA" hyperparams: lr: 0.0007 gamma: 0.99 training: total_steps: 10000000 eval_interval: 50000 seed: 42 |
- Multi-Algorithm Mixed Adversary: The red team learns against a randomized or strategically-rotated pool of pre-trained opponents. A typical configuration specifies the pool as a path list and a mixing strategy (e.g., uniform random selection).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
env: map_name: "MMM" adversary_mode: "mixed" opponent_pool: - "models/qmix_3m.pt" - "models/vdn_3m.pt" - "models/coma_3m.pt" mixing_strategy: "uniform" max_steps: 200 agents: red: algorithm: "DOP" hyperparams: lr: 0.0003 gamma: 0.99 training: total_steps: 2000000 eval_interval: 20000 seed: 123 |
This schema enables reproducible, custom, and extensible adversarial benchmarking.
4. Supported Algorithms and Loss Formulations
APyMARL incorporates canonical mixing and policy-gradient-based MARL algorithms, directly exposing loss formulations:
- QMIX-style critic loss:
- COMA-style actor loss (counterfactual advantage):
- Fairness penalty (optional regularizer for per-agent reward equality):
Included into actor or critic losses via a hyperparameter .
- DOP loss (tree backup with -return):
where is the TD() return.
This formulation catalog enables direct method comparison under genuinely adversarial testbeds.
5. Installation, Environment, and Example Usage
APyMARL requires a Linux environment with StarCraft II v4.6.2.6923, Python ≥3.7, PyTorch ≥1.9.0, and dependencies including PySC2 and Blizzard s2client-proto. Setup proceeds as:
1 2 3 4 |
git clone https://github.com/dooliu/SC2BA.git cd SC2BA pip install -r requirements.txt python setup.py install |
A minimal training workflow (e.g., dual adversary QMIX vs COMA) involves composing config files, instantiating environment and algorithms, and invoking the trainer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from apymarl.config import ConfigParser from apymarl.env import SC2BAEnv from apymarl.trainer import DualTeamTrainer from apymarl.algos.qmix import QMIX from apymarl.algos.coma import COMA config = ConfigParser("configs/dual_qmix_coma_3m.yaml").to_dict() env = SC2BAEnv( map_name=config['env']['map_name'], adversary_mode=config['env']['adversary_mode'], max_steps=config['env']['max_steps'] ) red_algo = QMIX(n_agents=3, obs_dim=37, act_dim=6, **config['agents']['red']['hyperparams']) blue_algo = COMA(n_agents=3, obs_dim=37, act_dim=6, **config['agents']['blue']['hyperparams']) trainer = DualTeamTrainer(env, red_algo, blue_algo, config['training']) trainer.train(num_steps=config['training']['total_steps']) trainer.save_models("checkpoints/qmix_red.pt", "checkpoints/coma_blue.pt") |
TensorBoard, CSV, and model snapshot logging are available throughout.
6. Benchmarking Results and Computational Considerations
Comprehensive baselines are provided for eight SMAC scenarios in both adversary modes, with the following summary for dual adversary (symmetric maps):
| Algorithm | Avg. Win Rate | #Scenarios Won | Converge Steps (Million) |
|---|---|---|---|
| DOP | 0.65 | 4 / 7 | 8.0 |
| QMIX | 0.63 | 2 / 7 | 9.0 |
| QPLEX | 0.60 | 0 / 7 | 10.0 |
| VDN | 0.58 | 0 / 7 | 9.5 |
| FOP | 0.57 | 1 / 7 | 9.2 |
| QTRAN | 0.52 | 0 / 7 | 10.0 |
| COMA | 0.50 | 0 / 7 | 10.0 |
| IQL | 0.45 | 0 / 7 | 10.0 |
In mixed adversary mode (ten maps), DOP again leads with a ~0.62 average win-rate. Empirical convergence time is 30–40 minutes per million steps on an NVIDIA Tesla V100, dependent on map complexity (3m vs 25m) (Li et al., 18 Dec 2025).
7. Customization, Extension, and Best Practices
- Scenario Customization: Rapid prototyping via edits to
map_name,env.max_steps, and unified map layouts insc2ba/maps/. - Hyperparameter Tuning: Recommended exploration in learning rates [1e-4, 1e-3], entropy coefficient [0, 0.01], and discount factor [0.95, 0.99].
- Fairness / Regularization: Toggle
reward_shaping: truein configs, or experiment with direct regularization. - Novel Maps / Asymmetric Layouts: Integrate via
SC2BAEnv._map_registryand adhere to symmetric spawn rules for statistical fairness. - Algorithm Extension: Inherit and implement new algorithms via
BaseAlgo, update API registration, and expose via YAML interface. - Future Directions: Dynamic mixed adversary settings, with both teams evolving, are suggested as an open extension by registering custom trainers.
The modular, standardized, and open architecture situates APyMARL as a research-oriented and extensible testbed for adversarial multi-agent learning, underpinned by explicit reproducibility and benchmarking design (Li et al., 18 Dec 2025).