3D Doom Simulation Environment

Updated 10 February 2026

3D Doom-based environment is a first-person simulation platform built on the Doom engine, offering modular design and multi-modal observations.
It supports diverse tasks such as navigation, combat, semantic mapping, and active 3D reconstruction with efficient CPU-only rendering.
Leveraging both synchronous and asynchronous action interfaces, this platform establishes reproducible benchmarks for RL and vision research.

A 3D Doom-based environment is a first-person, visually rich, partially observed simulation platform built on the Doom game engine, widely adopted in reinforcement learning (RL), computer vision (CV), and embodied agent research. These environments expose raw pixel observations, versatile action spaces, and scalable scenario design mechanisms, supporting tasks from navigation and combat to semantic mapping and active 3D reconstruction. Doom-based environments are notable for their modularity, scenario flexibility, multi-modality (RGB, depth, semantic buffers), and high-throughput CPU-only simulation, making them integral as reproducible RL and active mapping benchmarks and vision dataset generators.

1. Core Architecture and Simulation Dynamics

3D Doom environments are historically implemented atop open-source forks of the Doom engine—primarily ZDoom (in ViZDoom (Kempka et al., 2016, Wydmuch et al., 2018)) or Chocolate Doom (in ResearchDoom (Mahendran et al., 2016)). The simulation runs as a separate process, interfaced via C/C++ APIs and language bindings (Python, Lua, Java, Julia). Agents control the game loop in synchronous or asynchronous modes:

Synchronous: The engine halts every frame until the agent returns its action, enabling perfect determinism and arbitrarily accelerated training.
Asynchronous: Game advances at native simulation framerate (typically 35 FPS), requiring agents to process inputs in real time.

Off-screen rendering eliminates the need for display contexts, facilitating CPU-only batch simulation and deployment on headless servers. Resource efficiency is high: at 160×120 resolution, frame rates reach ≈7,000 FPS per core, and memory footprint per instance is ~10 MB (Wydmuch et al., 2018, Kempka et al., 2016). Up to 16 agents are supported in multiplayer mode with inter-agent TCP/IP communication.

2. Observations and Data Modalities

Agents’ observations are exposed at each time step as multi-modal streams:

RGB Screen Buffer: H×W×3 uint8 array, direct FPV rendering of the environment, with full fidelity to the Doom renderer’s field-of-view and display format.
Depth Buffer: Per-pixel float32 distances to the first visible surface; in standard ViZDoom enabled by request for 3D or SLAM use-cases (Kempka et al., 2016, Bhatti et al., 2016).
Label Buffer: Per-pixel integer IDs identifying object instances with class, bounding box, state (position, facing), enabling instance-level computer vision workloads (Wydmuch et al., 2018, Mahendran et al., 2016).
Top-Down Map: Bird’s-eye projections, optionally with only explored regions or full-visibility, with overlaid agent/object markers.
Game Variables: Float array encoding discrete sim state (health, armor, ammo, kill count, tick, etc.).

Customized ResearchDoom forks extend metadata streams with deterministic logs of player pose, environment state, and full object lists per frame, supporting true frame-level synchrony between vision sequences and annotations (Mahendran et al., 2016).

3. Action Spaces and Decision Interface

The agent’s actuator interface encompasses the full discrete Doom key set and mouse look control:

Button Array: Each action (MOVE_FORWARD, TURN_LEFT, ATTACK, etc.) is a Boolean channel. Actions are submitted as Boolean arrays, which are held for n_ticks frames as settable in the API (Wydmuch et al., 2018, Kempka et al., 2016).
Mouse Look: Absolute or relative yaw/pitch deltas (continuous values) can be injected, if enabled.
Compound Action Encoding: For “controller” agents (neuroevolution, LLM-based), the spectrum of multi-key and temporal macro-actions (actions repeated for multiple ticks) is programmatically defined (Alvernaz et al., 2017, Wynter, 2024).

Scenarios limit action sets for experimental control (e.g., reducing to navigation-only or shooting-only primitives (Hafner, 2016, Kempka et al., 2016)).

4. Scenario Authoring, Procedural Generation, and Dataset Design

Scenario definition is highly modular:

Custom WAD/PK3 Maps: Authored in Doom Builder 2, SLADE 3, or procedurally generated using tools like Obsidian, with parameterized control over level difficulty, topology, connectivity, and textures (Li et al., 7 Feb 2025).
Scripting (ACS, DECORATE, ZScript): Arbitrary logic for enemies, pickups, rewards, and terminal conditions is set via built-in scripting languages. Reward shaping, episodic termination, and object spawning logic are fully programmable (Wydmuch et al., 2018, Kempka et al., 2016).
Procedural Scene Datasets: AiMDoom leverages Obsidian to generate diverse sets of indoor scenes (Simple/Normal/Hard/Insane), parametrically controlled for benchmarking active 3D mapping (Li et al., 7 Feb 2025).
Benchmark Dataset Generation: ResearchDoom/CocoDoom records synchronized RGB, depth, instance masks, all agent and object states, and builds MSCOCO-style JSONs with thousands of annotated categories for detection and segmentation, as well as logs for ego-motion, tracking, and scene parsing (Mahendran et al., 2016).

5. Benchmarks, Algorithmic Workloads, and Reward Formulations

3D Doom-based environments support a spectrum of RL and vision benchmarks:

Navigation Tasks: Mazes (randomized via Kruskal’s algorithm (Parisotto et al., 2018)), medikit gathering (survival-based reward), and health-pack collection (Kempka et al., 2016, Alvernaz et al., 2017).
Combat/Deathmatch: Multiplayer or solo, with reward on kills, survival, or frags (kills–suicides) (Wydmuch et al., 2018, Hafner, 2016).
Semantic Mapping and SLAM: Pose estimation, topological/metric mapping (ORB-SLAM2 integration), and simultaneous localization via recurrent or memory-augmented deep nets (Bhatti et al., 2016, Parisotto et al., 2018).
Active 3D Mapping: Next-Best-Path (NBP) approaches in AiMDoom formulate a joint value map (for coverage gain) and local obstacle prediction, with curriculum learning and data augmentation (Li et al., 7 Feb 2025).
Neuroevolution: Autoencoder-bottleneck compression of observations, evolved via CMA-ES for controller parameters on survival or coverage tasks (Alvernaz et al., 2017).
LLM-based Agents: GPT-4 defers direct vision interpretation to GPT-4V (image → text), then executes command selection via in-context prompting, achieving partial but sub-expert play without gradient-based optimization (Wynter, 2024).

Reward functions are programmable per scenario: dense shaping (health deltas, living bonus) or sparse (kill events, room coverage). Q-learning target/bellman updates are employed for value-based methods, actor-critic variants, and evolutionary strategies as described in scenario-specific baselines (Kempka et al., 2016, Hafner, 2016, Bhatti et al., 2016).

Example reward law (health gathering):

$reward_t = +\alpha (health_{t} - health_{t-1}) - \beta \cdot \mathbb{I}_{dead} - \gamma \mathbb{I}_{timeout}$

6. Performance, Parallelism, and Experimentation Protocols

The efficiency of Doom-based simulation is a central design goal:

Frame Rate: On modern CPUs, greater than 2,500–7,000 FPS per instance, enabling massively parallel data collection and RL training (Wydmuch et al., 2018).
Frame Skipping: Agents can skip rendering/decisions every K frames to modulate the compute/decision-time trade-off and accelerate convergence (Kempka et al., 2016, Wydmuch et al., 2018).
Batch Experimentation: Single servers can run hundreds of environments in parallel due to low per-instance resource overhead (Wydmuch et al., 2018).
Synchronous/Asynchronous Protocols: Synchronous mode supports deterministic and highly reproducible runs; asynchronous primarily used for human or real-time multi-agent arenas.

RL pipelines wrap DoomGame objects in OpenAI Gym-style interfaces, exposing standard reset/step APIs, and supporting both on-policy and off-policy learning frameworks (Wydmuch et al., 2018, Kempka et al., 2016).

7. Reproducibility, Datasets, and Impact on RL and Vision Research

Doom-based environments have driven progress in several subareas:

RL Generalization and Competitions: ViZDoom Competitions standardized agent evaluation in multi-agent, multi-scenario settings; RL agents, though competent, remain subhuman in direct comparison (Wydmuch et al., 2018).
SLAM and Memory-Augmented RL: Joint architectures combining convolutional encoders, attention-based recurrent optimizers, and pose-aggregation have improved pose estimation drift and mapping compared to vanilla odometry (Parisotto et al., 2018, Bhatti et al., 2016).
Active Mapping: Next-Best-Path methods in AiMDoom achieve superior 3D surface coverage across procedurally generated benchmarks compared to local Next-Best-View and prior state-of-the-art (e.g., NBP: 0.879 final coverage in “Simple” maps vs. 0.760 for FBE) (Li et al., 7 Feb 2025).
Annotated Vision Corpora: CocoDoom provides ~80k to 500k frames with synchronized masks and depth, spanning 94 categories with MSCOCO-compliant annotation for detection, depth, and tracking experiments (Mahendran et al., 2016).
LLM Interfacing Experiments: LLM-based agents interacting with Doom solely through vision-to-text prompt interfaces have established a viable LLM control paradigm, with performance tunable via hierarchical in-context planner prompting (Wynter, 2024).

These platforms continue to serve as a bridge between synthetic RL/CV research and scalable, repeatable, and richly annotated embodied agent experimentation.