Papers
Topics
Authors
Recent
Search
2000 character limit reached

DIAMBRA Arena: Advanced RL Research Platform

Updated 13 April 2026
  • DIAMBRA Arena is an open-source reinforcement learning platform providing episodic, configurable environments built on the OpenAI Gym API for both single and multi-agent scenarios.
  • The platform supports advanced research methodologies including self-play, human-agent interaction, and imitation learning, enabling rapid integration with common RL frameworks.
  • Benchmark results demonstrate robust performance on commodity hardware, with agents exhibiting emergent human-like strategies across various retro fighting game environments.

DIAMBRA Arena is an open-source software platform for reinforcement learning (RL) research, providing a suite of high-quality, episodic environments with extensive support for single and multi-agent scenarios, human-agent interaction, imitation learning, and advanced RL workflows. It exposes these environments via a Python API that fully complies with the OpenAI Gym programming model, enabling rapid integration with common RL frameworks and facilitating research across a wide spectrum of contemporary RL problems (Palmas, 2022).

1. Platform Design, Goals, and Architecture

DIAMBRA Arena is designed to address limitations in existing RL benchmarks, which quickly become less informative once their challenges are solved. The platform prioritizes ongoing research needs such as competitive/cooperative multi-agent play, multi-modal observation spaces, human-in-the-loop paradigms, and transfer across games and difficulty levels. Its architecture emphasizes:

  • Modest hardware requirements: Usable on single GPU or even CPU-only systems.
  • Compliance with OpenAI Gym: The API supports the standard Gym methods (make, reset, step, render, close).
  • Highly configurable environments: Observation space includes raw pixels and a set of redundant “fair” RAM state features; discrete action spaces support fine-grained control and attack-button combinations.
  • First-class two-player support: All environments expose single and two-player modes, natively enabling agent-vs-agent, agent-vs-human, and self-play training out of the box.

Integration with OpenAI Gym is direct; “hello world” RL code runs unchanged except for instantiating environments via diambra.arena.make().

2. Environments, Observations, and Actions

The first release consists of multiple 2D retro fighting games (e.g., Dead Or Alive ++, Street Fighter III, Tekken Tag Tournament), each supporting multiple characters, outfits, and difficulty settings. The environment features are as follows:

Property Details / Options Notes
Pixel Input Shape Up to 480×512×3, configurable to 128×128 RGB or grayscale
RAM State Vector Health bars, stage index, last actions, etc. Redundant with pixels (optional for “hardcore”)
Action Spaces Discrete, Discrete+Combos, MultiDiscrete Each available for single/multi-agent modes
Modes 1P (single), 2P (dual/competitive) Episodic games, zero-sum competition

Observation spaces default to a gym.spaces.Dict containing both "frame" and "ram" entries. The action space may be a simple discrete set, a union of discrete combos, or split via a MultiDiscrete structure to separate moves and attacks. In 2P mode, the action space is a Gym Dict keyed by "P1" and "P2".

Episodes are defined in two distinct ways:

  • 1P mode: Runs from game start to completion of all stages or exhaustion of continues.
  • 2P mode: Terminates after a single fight (stage); each fight consists of NsN_s stages with NrN_r required round victories.

3. Operational Modes and Research Workflows

DIAMBRA Arena explicitly targets advanced RL research settings:

  • Single-agent RL: Standard episodic RL where the agent maximizes cumulative reward via exploration and exploitation.
  • Two-agent competitive RL: Simultaneous training or evaluation of separate policies for each player slot, leveraging zero-sum dynamics.
  • Self-play: Enabled by instantiating the same agent in both slots; supports curricula where learning progresses against past agent versions.
  • Human-agent interaction: Humans can play as either P1 or P2 via gamepad at runtime, and human-in-the-loop wrappers allow interventions for feedback, control, and reward shaping.
  • Imitation Learning: Trajectory recording tools allow collection of human demonstrations (observation/action/reward sequences in disk NPZ files); these are replayable in an RL-compatible manner via the ImitationLearning wrapper.

This design enables integrated workflows for transfer learning (policy transfer across difficulty, characters, or even separate fighting games) and generalization studies.

4. Mathematical Framework and RL Algorithms

The reward at time tt for each environment is formulated based on changes in health bars for each agent and opponent. The generic time-step reward is

Rt=i=1Nc[(HˉitHˉit)(H^itH^it)],R_t = \sum_{i=1}^{N_c} \left[\, (\bar{H}_i^{\,t^-} - \bar{H}_i^{\,t}) - (\hat{H}_i^{\,t^-} - \hat{H}_i^{\,t}) \right],

where Hˉi\bar{H}_i represents the opponent’s health, H^i\hat{H}_i denotes the agent’s health, and NcN_c is the number of characters per player. This structure is designed to reflect score differentials, driving agents toward strategies maximizing their own survival while minimizing their opponent’s.

Episode reward bounds are given by:

mint=0TsRt=Nc[(Ns1)(Nr1)+Nr]ΔH,maxt=0TsRt=NcNsNrΔH,\min\sum_{t=0}^{T_s}R_t = -N_c[(N_s-1)(N_r-1)+N_r]\Delta H,\quad \max\sum_{t=0}^{T_s}R_t = N_c N_s N_r \Delta H,

with ΔH=HmaxHmin\Delta H = H_{max} - H_{min}, NsN_s as number of stages, and NrN_r0 as rounds-to-win.

Empirical studies employ Proximal Policy Optimization (PPO) as the baseline RL algorithm, defined by

NrN_r1

where NrN_r2 is the advantage estimate.

5. Implementation Details, Network Architectures, and Empirical Results

A concrete experimental setup for Dead Or Alive ++ includes:

  • Input: frame shape NrN_r3 (grayscale), 4-frame stacking, action stacking (last 12 actions)
  • Actions: 12 discrete available actions, no combos, random starting side
  • Reward normalization and observation scaling applied, no reward clipping, no-op resets, or action sticking
  • Policy/value architecture:
    • Frame encoder: Conv(8×8,32) NrN_r4 Conv(4×4,64) NrN_r5 Conv(3×3,64) NrN_r6 FC(256) with ReLU
    • RAM encoder: FC(64) NrN_r7 FC(64) (tanh)
    • Latent merge: 320 units, dual heads (policy: FC(12) + softmax, value: FC(1))
  • PPO hyperparameters: 16 parallel environments, 128 steps per update, batch size 256, 4 epochs, NrN_r8, learning rate NrN_r9 annealed to tt0, clip parameter from 0.15 to 0.025

Performance benchmarks on commodity hardware show tt1 env-steps/day (Intel i5 + GTX 1050) to tt2 env-steps/day (Ryzen 9 + GTX 1080 Ti). After tt3 million steps, average episode rewards climb from tt4 (random) to tt5, with PPO agents showing emergent human-like play patterns such as delayed attacks and timed counters. Results are robust across Street Fighter III and Tekken Tag Tournament, confirming environment generality (Palmas, 2022).

6. Advanced Features and Research Facilitation

DIAMBRA Arena’s built-in wrappers and multi-agent abstractions make it possible to:

  • Construct self-play curricula where agents continually adapt against their own historical policy snapshots
  • Support human-in-the-loop learning: real-time human interventions (manual override, evaluative feedback) via integrated wrappers
  • Enable imitation learning through expert demonstration collection and replay
  • Conduct transfer and generalization studies by altering characters, outfits, difficulty, or game title
  • Prototype cooperative multi-agent settings, with competitive fights as the current default and cooperative scenarios anticipated in future releases

Code usage patterns are identical to Gym, including native support for trajectory replay:

tt6

This tight integration with common RL APIs significantly lowers the startup cost for researchers entering advanced RL and imitation learning domains.

7. Significance and Impact within RL Research

DIAMBRA Arena’s introduction marks a shift in benchmark philosophy for RL, prioritizing extensibility, human-agent collaboration, and challenging, previously unsolved tasks. By fully integrating with OpenAI Gym, supporting multi-agent and human-level research, and enabling rapid prototyping of custom RL workflows, the software accelerates research into self-play, human-in-the-loop learning, imitation, generalization, and beyond. Its results to date—emergent human-like play across multiple retro fighting titles—demonstrate its value for both evaluating algorithms and conducting novel RL research (Palmas, 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DIAMBRA Arena.