DIAMBRA Arena: Advanced RL Research Platform
- DIAMBRA Arena is an open-source reinforcement learning platform providing episodic, configurable environments built on the OpenAI Gym API for both single and multi-agent scenarios.
- The platform supports advanced research methodologies including self-play, human-agent interaction, and imitation learning, enabling rapid integration with common RL frameworks.
- Benchmark results demonstrate robust performance on commodity hardware, with agents exhibiting emergent human-like strategies across various retro fighting game environments.
DIAMBRA Arena is an open-source software platform for reinforcement learning (RL) research, providing a suite of high-quality, episodic environments with extensive support for single and multi-agent scenarios, human-agent interaction, imitation learning, and advanced RL workflows. It exposes these environments via a Python API that fully complies with the OpenAI Gym programming model, enabling rapid integration with common RL frameworks and facilitating research across a wide spectrum of contemporary RL problems (Palmas, 2022).
1. Platform Design, Goals, and Architecture
DIAMBRA Arena is designed to address limitations in existing RL benchmarks, which quickly become less informative once their challenges are solved. The platform prioritizes ongoing research needs such as competitive/cooperative multi-agent play, multi-modal observation spaces, human-in-the-loop paradigms, and transfer across games and difficulty levels. Its architecture emphasizes:
- Modest hardware requirements: Usable on single GPU or even CPU-only systems.
- Compliance with OpenAI Gym: The API supports the standard Gym methods (
make,reset,step,render,close). - Highly configurable environments: Observation space includes raw pixels and a set of redundant “fair” RAM state features; discrete action spaces support fine-grained control and attack-button combinations.
- First-class two-player support: All environments expose single and two-player modes, natively enabling agent-vs-agent, agent-vs-human, and self-play training out of the box.
Integration with OpenAI Gym is direct; “hello world” RL code runs unchanged except for instantiating environments via diambra.arena.make().
2. Environments, Observations, and Actions
The first release consists of multiple 2D retro fighting games (e.g., Dead Or Alive ++, Street Fighter III, Tekken Tag Tournament), each supporting multiple characters, outfits, and difficulty settings. The environment features are as follows:
| Property | Details / Options | Notes |
|---|---|---|
| Pixel Input Shape | Up to 480×512×3, configurable to 128×128 | RGB or grayscale |
| RAM State Vector | Health bars, stage index, last actions, etc. | Redundant with pixels (optional for “hardcore”) |
| Action Spaces | Discrete, Discrete+Combos, MultiDiscrete | Each available for single/multi-agent modes |
| Modes | 1P (single), 2P (dual/competitive) | Episodic games, zero-sum competition |
Observation spaces default to a gym.spaces.Dict containing both "frame" and "ram" entries. The action space may be a simple discrete set, a union of discrete combos, or split via a MultiDiscrete structure to separate moves and attacks. In 2P mode, the action space is a Gym Dict keyed by "P1" and "P2".
Episodes are defined in two distinct ways:
- 1P mode: Runs from game start to completion of all stages or exhaustion of continues.
- 2P mode: Terminates after a single fight (stage); each fight consists of stages with required round victories.
3. Operational Modes and Research Workflows
DIAMBRA Arena explicitly targets advanced RL research settings:
- Single-agent RL: Standard episodic RL where the agent maximizes cumulative reward via exploration and exploitation.
- Two-agent competitive RL: Simultaneous training or evaluation of separate policies for each player slot, leveraging zero-sum dynamics.
- Self-play: Enabled by instantiating the same agent in both slots; supports curricula where learning progresses against past agent versions.
- Human-agent interaction: Humans can play as either P1 or P2 via gamepad at runtime, and human-in-the-loop wrappers allow interventions for feedback, control, and reward shaping.
- Imitation Learning: Trajectory recording tools allow collection of human demonstrations (observation/action/reward sequences in disk NPZ files); these are replayable in an RL-compatible manner via the
ImitationLearningwrapper.
This design enables integrated workflows for transfer learning (policy transfer across difficulty, characters, or even separate fighting games) and generalization studies.
4. Mathematical Framework and RL Algorithms
The reward at time for each environment is formulated based on changes in health bars for each agent and opponent. The generic time-step reward is
where represents the opponent’s health, denotes the agent’s health, and is the number of characters per player. This structure is designed to reflect score differentials, driving agents toward strategies maximizing their own survival while minimizing their opponent’s.
Episode reward bounds are given by:
with , as number of stages, and 0 as rounds-to-win.
Empirical studies employ Proximal Policy Optimization (PPO) as the baseline RL algorithm, defined by
1
where 2 is the advantage estimate.
5. Implementation Details, Network Architectures, and Empirical Results
A concrete experimental setup for Dead Or Alive ++ includes:
- Input: frame shape 3 (grayscale), 4-frame stacking, action stacking (last 12 actions)
- Actions: 12 discrete available actions, no combos, random starting side
- Reward normalization and observation scaling applied, no reward clipping, no-op resets, or action sticking
- Policy/value architecture:
- PPO hyperparameters: 16 parallel environments, 128 steps per update, batch size 256, 4 epochs, 8, learning rate 9 annealed to 0, clip parameter from 0.15 to 0.025
Performance benchmarks on commodity hardware show 1 env-steps/day (Intel i5 + GTX 1050) to 2 env-steps/day (Ryzen 9 + GTX 1080 Ti). After 3 million steps, average episode rewards climb from 4 (random) to 5, with PPO agents showing emergent human-like play patterns such as delayed attacks and timed counters. Results are robust across Street Fighter III and Tekken Tag Tournament, confirming environment generality (Palmas, 2022).
6. Advanced Features and Research Facilitation
DIAMBRA Arena’s built-in wrappers and multi-agent abstractions make it possible to:
- Construct self-play curricula where agents continually adapt against their own historical policy snapshots
- Support human-in-the-loop learning: real-time human interventions (manual override, evaluative feedback) via integrated wrappers
- Enable imitation learning through expert demonstration collection and replay
- Conduct transfer and generalization studies by altering characters, outfits, difficulty, or game title
- Prototype cooperative multi-agent settings, with competitive fights as the current default and cooperative scenarios anticipated in future releases
Code usage patterns are identical to Gym, including native support for trajectory replay:
6
This tight integration with common RL APIs significantly lowers the startup cost for researchers entering advanced RL and imitation learning domains.
7. Significance and Impact within RL Research
DIAMBRA Arena’s introduction marks a shift in benchmark philosophy for RL, prioritizing extensibility, human-agent collaboration, and challenging, previously unsolved tasks. By fully integrating with OpenAI Gym, supporting multi-agent and human-level research, and enabling rapid prototyping of custom RL workflows, the software accelerates research into self-play, human-in-the-loop learning, imitation, generalization, and beyond. Its results to date—emergent human-like play across multiple retro fighting titles—demonstrate its value for both evaluating algorithms and conducting novel RL research (Palmas, 2022).