Papers
Topics
Authors
Recent
Search
2000 character limit reached

PyMARL: Modular MARL Research Framework

Updated 8 March 2026
  • PyMARL is a modular framework based on PyTorch that standardizes training and evaluation protocols for cooperative MARL on the SMAC benchmark.
  • It supports centralized training with decentralized execution by implementing algorithms like IQL, VDN, QMIX, COMA, and QTRAN with rigorous benchmarking protocols.
  • Its layered architecture and YAML-configured setups facilitate scalable experiment configuration, efficient data flow, and consistent reproducibility through integrated TensorBoard logging.

PyMARL is a modular, PyTorch-based framework specifically designed for deep cooperative multi-agent reinforcement learning (MARL) research, introduced in conjunction with the StarCraft Multi-Agent Challenge (SMAC). It provides a research-grade infrastructure for algorithm implementation, training, evaluation, and rigorous benchmarking in the context of partially observable, cooperative, multi-agent tasks, with particular integration for the SMAC benchmark suite based on StarCraft II micromanagement scenarios (Samvelyan et al., 2019).

1. Architectural Organization and Data Flow

PyMARL adopts a layered, modular architecture structured around standardized interfaces for reproducibility and extensibility. Its environment layer wraps the SC2LE Raw API to present a Gym-like interface (obs_n, state, avail_actions_n, reward, done, info) via envs/sc2_wrapper.py. The episode agent-environment loop is managed by the Runner (runners/episode_runner.py), which orchestrates sampling, trajectory packaging, and data transfer to a configurable [Replay Buffer](https://www.emergentmind.com/topics/replay-buffer) (memory/replay_buffer.py), typically storing the most recent 5,000 episodes.

The core learning pipeline decouples environment interaction from policy improvement. At the end of each episode, trajectories are pushed to the buffer; after a warm-up period, batches of episodes are sampled for learning. The Learner module (learners/*.py)—specific to each algorithm—computes losses and adapts network parameters. Action selection is managed by the Controller (controllers/basic_controller.py), which encapsulates policy networks and hidden-state management for recurrent setups, applying ε-greedy during training and arg-max in evaluation. YAML/config-based specification enables reproducible experiment configuration and multi-seed execution, with training and evaluation statistics logged via TensorBoard.

Centralized training with decentralized execution (CTDE) is a defining paradigm: during training, the learner and runner access the global state, full action-observation histories, and joint actions, enabling centralized critics or mixing networks. During execution, each agent operates solely on its private observation-action trajectory.

2. Supported Algorithms: Implementations and Losses

PyMARL delivers reference implementations of several state-of-the-art MARL algorithms, with explicit support for centralized learning and decentralized execution:

  • IQL (Independent Q-Learning): Each agent learns an independent Q-function Qi(Ï„i,ui;θi)Q_i(\tau^i, u^i; \theta_i) updated via temporal-difference loss, ignorant of other agents’ actions or global state.
  • VDN (Value-Decomposition Networks): Decomposes the joint action-value QtotQ_{tot} as a sum of agent-wise Q-values. The TD target is computed analogously, with gradient updates over the sum.
  • QMIX: Introduces a monotonic mixing network fmixf_{mix} combining agent-wise Q’s into QtotQ_{tot}, guaranteeing ∂Qtot∂Qi≥0\frac{\partial Q_{tot}}{\partial Q_i} \geq 0 for valid credit assignment. The loss compares the output to a bootstrapped TD-target, with separate target networks for stability.
  • COMA (Counterfactual Multi-Agent Policy Gradients): Implements a centralized actor-critic with a counterfactual baseline for credit assignment. The advantage for agent ii is Ai(s,u)=Q(s,u)−∑u′iÏ€i(u′i∣τi)Q(s,(u−i,u′i))A^i(s, u) = Q(s, u) - \sum_{u'^i} \pi^i(u'^i|\tau^i) Q(s, (u^{-i}, u'^i)). Policy gradients are estimated accordingly.
  • QTRAN: Maximizes a transformed Qtot(s,u;θ)Q_{tot}(s, u; \theta) subject to affine constraints relating it to the sum of individual per-agent Qs, with a composite loss incorporating TD error and constraint penalties (Samvelyan et al., 2019).

3. Network Architectures and Hyperparameter Settings

Policy and value networks use normalized concatenated feature vectors as input, comprising local spatial and unit features, last action one-hot encodings, and agent identity. Architectures across IQL/VDN/QMIX/QTRAN employ:

  • FC(128, ReLU) → GRU(64) → FC(|U|), where |U| is the cardinality of the agent’s action space.
  • COMA’s actor network parallels this form but parameterizes a softmax policy.

Mixing networks for QMIX employ two hyper-networks (each FC(64, ReLU)) to produce nonnegative weights and biases, feeding a single hidden-layer mixer (32 ELU units). Critics (COMA/QTRAN) use (FC(128)→FC(128)→FC(|U|n)); all layers use ReLU unless specified.

Default hyperparameters are: RMSProp optimizer with learning rate 5×10−45 \times 10^{-4}, α=0.99, epis=1e-5, γ=0.99, replay buffer of 5,000 episodes, batch size 32, ε annealed from 1.0 to 0.05 over 50,000 environment steps, and target network updates every 200 steps. Entire episodes are unrolled for a single gradient update per training iteration.

4. Software Engineering and Experimentation Workflow

The codebase is organized for modular extensibility and clear research prototyping:

  • Entry points (scripts/): train.py and test.py parse command-line or YAML configs, configure agents/maps/hyperparameters, and launch training or evaluation.
  • src/envs: SC2 environment wrappers
  • src/controllers: Policy and baseline controllers (random, basic)
  • src/runners: Episode and parallel runners
  • src/memory: Replay buffer
  • src/learners: Algorithm-specific learners
  • src/models: Agent encoders and mixing networks (DRQN, mixing/hyper nets)
  • src/utils: Logging, deterministic seeding, evaluation utilities

A typical experiment initialization relies on reproducible seed setting, environment setup, replay buffer instantiation, controller/learner construction, and an episode-centric training loop with periodic evaluation and checkpointing.

Launching experiments is achieved via command-line interface, e.g.:

1
python scripts/train.py --config=qmix --env-config=sc2 --map=3s5z --seed=42 --buffer-size=5000 ...
Experiment configuration is fully specified by YAML files (e.g. configs/qmix.yaml), controlling algorithm, environment, and all hyperparameters (Samvelyan et al., 2019).

5. Best Practices and Benchmarking Protocols

PyMARL is structured to ensure rigorous, reproducible benchmarking in SMAC and compatible multi-agent environments:

  • Maintain fixed environment parameters (map files, AI difficulty, reward shaping) across runs.
  • Use shaped rewards (damage+kill+win bonus) for comparability.
  • Run periodic greedy evaluation every 10,000 environment steps on at least 32 episodes.
  • Report median learning curves over ≥5 seeds with 25–75% percentiles, and visualize via TensorBoard.
  • Log wall-clock training time, GPU specifications, and approximate runtime per map (Samvelyan et al., 2019).

PyMARL forms the foundation of an evolving ecosystem of MARL research infrastructure:

  • APyMARL (Adversarial PyMARL): Extends PyMARL and SMAC by introducing algorithm-vs-algorithm evaluation in the SC2 Battle Arena (SC2BA), supporting symmetric/asymmetric adversarial training, modular support for various value-decomposition and policy-gradient algorithms, and advanced logging, extensibility, and fairness protocols (Li et al., 18 Dec 2025).
  • PyMARLzoo+: Builds on (E)PyMARL; offers integration with a broader set of fully cooperative benchmarks (PettingZoo, Overcooked, PressurePlate, etc.) and implements modules for pre-trained image encoding and intrinsic/exploration-driven learning. It enables unified training and evaluation across diverse domains and reports wall-clock times for scalable benchmarks (Papadopoulos et al., 7 Feb 2025).

These derivatives preserve the modular training pipeline and configuration philosophy of PyMARL, while augmenting environment compatibility and algorithmic breadth.

7. Scientific Impact and Practical Significance

PyMARL set a reproducible gold standard for empirically benchmarking deep cooperative MARL algorithms. It enabled standardized comparison and progress assessment on challenging partially observable tasks, catalyzing the development and critical evaluation of key algorithms (IQL, VDN, QMIX, COMA, QTRAN). PyMARL’s emphasis on CTDE, batch-episode dataflow, reproducible configuration, and best-practice evaluation protocols has influenced subsequent benchmark frameworks and continues to underpin experimental baselines for multi-agent deep RL (Samvelyan et al., 2019, Li et al., 18 Dec 2025, Papadopoulos et al., 7 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PyMARL Framework.