DIAMBRA Arena: Advanced RL Research Platform

Updated 13 April 2026

DIAMBRA Arena is an open-source reinforcement learning platform providing episodic, configurable environments built on the OpenAI Gym API for both single and multi-agent scenarios.
The platform supports advanced research methodologies including self-play, human-agent interaction, and imitation learning, enabling rapid integration with common RL frameworks.
Benchmark results demonstrate robust performance on commodity hardware, with agents exhibiting emergent human-like strategies across various retro fighting game environments.

DIAMBRA Arena is an open-source software platform for reinforcement learning (RL) research, providing a suite of high-quality, episodic environments with extensive support for single and multi-agent scenarios, human-agent interaction, imitation learning, and advanced RL workflows. It exposes these environments via a Python API that fully complies with the OpenAI Gym programming model, enabling rapid integration with common RL frameworks and facilitating research across a wide spectrum of contemporary RL problems (Palmas, 2022).

1. Platform Design, Goals, and Architecture

DIAMBRA Arena is designed to address limitations in existing RL benchmarks, which quickly become less informative once their challenges are solved. The platform prioritizes ongoing research needs such as competitive/cooperative multi-agent play, multi-modal observation spaces, human-in-the-loop paradigms, and transfer across games and difficulty levels. Its architecture emphasizes:

Modest hardware requirements: Usable on single GPU or even CPU-only systems.
Compliance with OpenAI Gym: The API supports the standard Gym methods (make, reset, step, render, close).
Highly configurable environments: Observation space includes raw pixels and a set of redundant “fair” RAM state features; discrete action spaces support fine-grained control and attack-button combinations.
First-class two-player support: All environments expose single and two-player modes, natively enabling agent-vs-agent, agent-vs-human, and self-play training out of the box.

Integration with OpenAI Gym is direct; “hello world” RL code runs unchanged except for instantiating environments via diambra.arena.make().

2. Environments, Observations, and Actions

The first release consists of multiple 2D retro fighting games (e.g., Dead Or Alive ++, Street Fighter III, Tekken Tag Tournament), each supporting multiple characters, outfits, and difficulty settings. The environment features are as follows:

Property	Details / Options	Notes
Pixel Input Shape	Up to 480×512×3, configurable to 128×128	RGB or grayscale
RAM State Vector	Health bars, stage index, last actions, etc.	Redundant with pixels (optional for “hardcore”)
Action Spaces	Discrete, Discrete+Combos, MultiDiscrete	Each available for single/multi-agent modes
Modes	1P (single), 2P (dual/competitive)	Episodic games, zero-sum competition

Observation spaces default to a gym.spaces.Dict containing both "frame" and "ram" entries. The action space may be a simple discrete set, a union of discrete combos, or split via a MultiDiscrete structure to separate moves and attacks. In 2P mode, the action space is a Gym Dict keyed by "P1" and "P2".

Episodes are defined in two distinct ways:

1P mode: Runs from game start to completion of all stages or exhaustion of continues.
2P mode: Terminates after a single fight (stage); each fight consists of $N_s$ stages with $N_r$ required round victories.

3. Operational Modes and Research Workflows

DIAMBRA Arena explicitly targets advanced RL research settings:

Single-agent RL: Standard episodic RL where the agent maximizes cumulative reward via exploration and exploitation.
Two-agent competitive RL: Simultaneous training or evaluation of separate policies for each player slot, leveraging zero-sum dynamics.
Self-play: Enabled by instantiating the same agent in both slots; supports curricula where learning progresses against past agent versions.
Human-agent interaction: Humans can play as either P1 or P2 via gamepad at runtime, and human-in-the-loop wrappers allow interventions for feedback, control, and reward shaping.
Imitation Learning: Trajectory recording tools allow collection of human demonstrations (observation/action/reward sequences in disk NPZ files); these are replayable in an RL-compatible manner via the ImitationLearning wrapper.

This design enables integrated workflows for transfer learning (policy transfer across difficulty, characters, or even separate fighting games) and generalization studies.

4. Mathematical Framework and RL Algorithms

The reward at time $t$ for each environment is formulated based on changes in health bars for each agent and opponent. The generic time-step reward is

$R_t = \sum_{i=1}^{N_c} \left[\, (\bar{H}_i^{\,t^-} - \bar{H}_i^{\,t}) - (\hat{H}_i^{\,t^-} - \hat{H}_i^{\,t}) \right],$

where $\bar{H}_i$ represents the opponent’s health, $\hat{H}_i$ denotes the agent’s health, and $N_c$ is the number of characters per player. This structure is designed to reflect score differentials, driving agents toward strategies maximizing their own survival while minimizing their opponent’s.

Episode reward bounds are given by:

$\min\sum_{t=0}^{T_s}R_t = -N_c[(N_s-1)(N_r-1)+N_r]\Delta H,\quad \max\sum_{t=0}^{T_s}R_t = N_c N_s N_r \Delta H,$

with $\Delta H = H_{max} - H_{min}$ , $N_s$ as number of stages, and $N_r$ 0 as rounds-to-win.

Empirical studies employ Proximal Policy Optimization (PPO) as the baseline RL algorithm, defined by

$N_r$ 1

where $N_r$ 2 is the advantage estimate.

5. Implementation Details, Network Architectures, and Empirical Results

A concrete experimental setup for Dead Or Alive ++ includes:

Input: frame shape $N_r$ 3 (grayscale), 4-frame stacking, action stacking (last 12 actions)
Actions: 12 discrete available actions, no combos, random starting side
Reward normalization and observation scaling applied, no reward clipping, no-op resets, or action sticking
Policy/value architecture:
- Frame encoder: Conv(8×8,32) $N_r$ 4 Conv(4×4,64) $N_r$ 5 Conv(3×3,64) $N_r$ 6 FC(256) with ReLU
- RAM encoder: FC(64) $N_r$ 7 FC(64) (tanh)
- Latent merge: 320 units, dual heads (policy: FC(12) + softmax, value: FC(1))
PPO hyperparameters: 16 parallel environments, 128 steps per update, batch size 256, 4 epochs, $N_r$ 8, learning rate $N_r$ 9 annealed to $t$ 0, clip parameter from 0.15 to 0.025

Performance benchmarks on commodity hardware show $t$ 1 env-steps/day (Intel i5 + GTX 1050) to $t$ 2 env-steps/day (Ryzen 9 + GTX 1080 Ti). After $t$ 3 million steps, average episode rewards climb from $t$ 4 (random) to $t$ 5, with PPO agents showing emergent human-like play patterns such as delayed attacks and timed counters. Results are robust across Street Fighter III and Tekken Tag Tournament, confirming environment generality (Palmas, 2022).

6. Advanced Features and Research Facilitation

DIAMBRA Arena’s built-in wrappers and multi-agent abstractions make it possible to:

Construct self-play curricula where agents continually adapt against their own historical policy snapshots
Support human-in-the-loop learning: real-time human interventions (manual override, evaluative feedback) via integrated wrappers
Enable imitation learning through expert demonstration collection and replay
Conduct transfer and generalization studies by altering characters, outfits, difficulty, or game title
Prototype cooperative multi-agent settings, with competitive fights as the current default and cooperative scenarios anticipated in future releases

Code usage patterns are identical to Gym, including native support for trajectory replay:

$t$ 6

This tight integration with common RL APIs significantly lowers the startup cost for researchers entering advanced RL and imitation learning domains.

7. Significance and Impact within RL Research

DIAMBRA Arena’s introduction marks a shift in benchmark philosophy for RL, prioritizing extensibility, human-agent collaboration, and challenging, previously unsolved tasks. By fully integrating with OpenAI Gym, supporting multi-agent and human-level research, and enabling rapid prototyping of custom RL workflows, the software accelerates research into self-play, human-in-the-loop learning, imitation, generalization, and beyond. Its results to date—emergent human-like play across multiple retro fighting titles—demonstrate its value for both evaluating algorithms and conducting novel RL research (Palmas, 2022).

Markdown Report Issue Upgrade to Chat

References (1)

DIAMBRA Arena: a New Reinforcement Learning Platform for Research and Experimentation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DIAMBRA Arena.