D4RL & NeoRL Benchmarks Overview

Updated 6 December 2025

D4RL and NeoRL benchmarks are evaluation suites for offline reinforcement learning that use realistic, diverse datasets and environments to mimic practical data collection constraints.
They employ unified APIs, normalized scoring metrics, and standardized evaluation protocols to systematically compare algorithm performance across various data regimes.
Their design addresses key issues like safety, partial observability, and non-Markovian dynamics, pushing advancements in robust, deployment-focused reinforcement learning research.

D4RL and NeoRL Benchmarks

D4RL (Datasets for Deep Data-Driven Reinforcement Learning) and the NeoRL benchmark family represent two of the most widely adopted suites for evaluating offline reinforcement learning (offline RL) algorithms. D4RL aims to provide challenging, realistic, and diverse datasets inspired by real-world data collection protocols, while the NeoRL series extends this paradigm by focusing on domains and data properties that directly address practical, deployment-driven RL challenges such as safety, conservative data regimes, partial observability, and non-Markovian dynamics. Both suites have driven advances in offline RL by providing unified APIs, established evaluation metrics, carefully constructed datasets, and community benchmarks that highlight prevailing algorithmic limitations on realistic tasks.

1. Core Structure and Design Principles

D4RL defines a broad set of environments explicitly tailored for the offline RL setting, emphasizing the realistic constraints and heterogeneity found in practical data (Fu et al., 2020). Datasets are generated using a variety of controllers—including learned policies at different levels of proficiency, hand-designed planners, and human demonstrators—and are designed to yield broad classes of offline learning challenges:

Bias due to narrow support (expert data)
High variance or under-constrained support (random/medium policies)
Sparse rewards and compositional “stitching” (e.g., AntMaze, Kitchen)
Undirected, multitask, or multimodal logs (e.g., Kitchen-mixed, FrankaKitchen)

NeoRL and its successor NeoRL-2 build upon these ideas but select domains and data generation protocols that accentuate deployment bottlenecks:

Conservative, low-coverage data from deterministic or PID controllers
Explicit time delays, partial observability, and non-Markovian transitions
Global safety constraints and exogenous factors (e.g., variable wind, friction)
Data sizes reflecting real collection limitations (100s–1,000s of samples)
Unreliable or high-variance off-policy evaluation

NeoRL’s design imposes controlled dataset sizes, deterministic- and stochastic-behavior policy splits, and often provides explicit offline test sets to more closely mimic real-world hyperparameter tuning and policy validation scenarios (Qin et al., 2021, Gao et al., 25 Mar 2025).

2. Dataset Composition and Task Taxonomy

D4RL organizes tasks and datasets into multiple domains and collection regimes, with the following prominent examples:

MuJoCo locomotion: Hopper-v2, HalfCheetah-v2, Walker2d-v2, each with “random,” “medium,” “medium-replay,” and “medium-expert” buffers, sizes typically 1–2 million transitions.
Sparse-reward and multitask navigation: Maze2D and AntMaze, with planner-generated or mixture datasets to stress trajectory stitching under partial observability.
Adroit hand manipulation: Human and robot demonstrations (e.g., pen, hammer, door), emphasizing high-dimensional control with limited sample regimes.
Traffic, robotics, and vision: Flow, CARLA (driving from images), FrankaKitchen for subtask composition with human demos.

NeoRL extends the regime with tasks emphasizing industrial and financial control, stochastic and non-stationary environments, and energy management, incorporating domains such as the Industrial Benchmark, FinRL, and CityLearn (Qin et al., 2021). NeoRL-2 introduces seven "near-realistic" tasks modeling time delays, external factors, safety constraints, and data from deterministic policies (Table 1).

Benchmark	Example Domains (Tasks)	Data Regimes (Examples)
D4RL	MuJoCo, AntMaze, Adroit, Kitchen, CARLA	random, medium, expert, human demos
NeoRL	MuJoCo-v3, Industrial, FinRL, CityLearn	L/M/H policy, det/stoch, test sets
NeoRL-2	Pipeline, Simglucose, Fusion, SafetyHC	PID, delays, exogenous variables

This organizational framework enables systematic benchmarking for a spectrum of algorithmic capabilities: interpolation vs. extrapolation, safe generalization, multi-task and long-horizon planning, and robust control under data scarcity (Gao et al., 25 Mar 2025).

3. Evaluation Protocols and Metrics

Both D4RL and NeoRL use strictly offline protocols: agents train on fixed datasets $\mathcal{D}$ , and are evaluated by deploying learned policies in the corresponding (often high-fidelity) simulation environment. Key aspects include:

Normalization: Returns are typically scaled relative to task-defined baselines (random, expert) using:

$\mathrm{NormalizedScore}(\pi) = 100 \times \frac{R_\pi - R_\mathrm{rand}}{R_\mathrm{expert} - R_\mathrm{rand}}$

Success rate: Used for sparse-reward navigation tasks, e.g., the percentage of AntMaze episodes in which the goal is reached (Park et al., 30 Jun 2024).
Safety metrics: For NeoRL-2, constraints violations (e.g., exceeding velocity) trigger episode termination and severe negative rewards (Gao et al., 25 Mar 2025).
Stochasticity and Distributional Robustness: NeoRL and NeoRL-2 emphasize robustness by varying exogenous inputs and reporting means over large numbers of runs and seeds.

For algorithm development, the normalized score (0 = random, 100 = expert) enables fair comparison across domains and data regimes. NeoRL datasets frequently provide extra test sets (simulating held-out real data) to benchmark true offline policy evaluation (OPE) (Qin et al., 2021).

4. Empirical Findings and Algorithmic Impact

Adoption of D4RL and NeoRL has standardized empirical evaluation, revealing many limitations of contemporary offline RL methods:

Conservative/Distributional Shift: Even state-of-the-art offline RL algorithms (e.g., CQL, TD3+BC, MOPO, MOBILE, RAMBO, COMBO) often fail to exceed the deterministic behavior policy on narrow, low-coverage, or safety-critical datasets, especially in NeoRL(-2) scenarios (Gao et al., 25 Mar 2025, Qin et al., 2021).
Compounding Modeling Error: Model-based methods are vulnerable to rollout error accumulation, particularly in long-horizon or sparse-reward tasks (AntMaze, NeoRL/Fusion) (Park et al., 30 Jun 2024, Luo et al., 2023).
Partial Observability/Delayed Effects: NeoRL-2 benchmarks demonstrate that delayed reward propagation and external factors rapidly degrade the performance and reliability of offline RL, necessitating new architectures with memory and causal inference capabilities (Gao et al., 25 Mar 2025).
Robustness: The introduction of corrected and robustness-aware metrics (as in QDax) further reveals that standard one-shot evaluations can substantially overestimate actual policy quality under stochasticity (Flageat et al., 2022).

D4RL and NeoRL thus serve as de facto proving grounds for advances in penalty-based conservatism (CQL), reward-consistent modeling (MOREC), value-inconsistency penalization (VIPO), adaptive Bayesian RL (Neubay), and hybrid imitation/self-supervised architectures, each exhibiting unique strengths and vulnerabilities exposed by differing benchmark regimes (Chen et al., 16 Apr 2025, Ni et al., 4 Dec 2025, Luo et al., 2023).

5. Extensions, Specialized Benchmarks, and Unified APIs

The influence of D4RL has also driven the creation of specialized or extended benchmarks and toolkits:

Katakomba (NetHack): Adapts D4RL concepts (HDF5 datasets, Gym-API, normalized scoring) to complex, long-horizon, discrete, and highly variable tasks such as NetHack (Kurenkov et al., 2023).
Quality-Diversity Benchmarks: As in QDax, focuses on fitness-diversity tradeoffs in open-loop neuroevolution for RL tasks, reporting coverage, QD-score, and corrected profiles under stochasticity (Flageat et al., 2022).
Autonomous Driving (AD4RL): Introduces real-world and simulator-generated driving datasets, emphasizing partial observability, hybrid action spaces, and safety metrics, following unified evaluation designs akin to D4RL/NeoRL (Lee et al., 3 Apr 2024).

The widespread adoption of unified APIs (e.g., Gym-like registration, HDF5 dataset loaders with state/action/reward arrays, standard scripts for metrics and reporting) has further lowered the bar for reproducible and extensible benchmarking, facilitating side-by-side comparisons for new offline RL algorithms (Fu et al., 2020, Kurenkov et al., 2023).

6. Significance, Limitations, and Future Directions

D4RL and NeoRL benchmarks have become central for assessing offline RL progress, but several open challenges persist:

Realism vs. Tractability: D4RL domains remain largely simulator-based and fully Markovian, underrepresenting real-world delays, constraints, and exogenous disturbances. NeoRL-2 and AD4RL address some of these gaps but also illustrate that contemporary algorithms struggle substantially under these conditions (Gao et al., 25 Mar 2025, Lee et al., 3 Apr 2024).
Policy Evaluation and OPE: Accurate offline policy evaluation remains unsolved, with FQE and model-based approaches often yielding poor correlation with actual returns (Qin et al., 2021).
Algorithmic Gaps: Model-based and model-free offline RL methods require significant advances in robust generalization under narrow data distributions, safety constraints, non-Markovianity, and system identification.
Adaptive, Plug-and-Play Benchmarks: There is increasing interest in benchmarks that facilitate rapid creation, augmentation, and stratification of tasks, e.g., in NetHack/Katakomba or the modular pipelines of AD4RL (Kurenkov et al., 2023, Lee et al., 3 Apr 2024).

A plausible implication is that future benchmarks will increasingly integrate real-world data collection constraints, stricter safety/robustness validation, and modular, extensible task APIs to accelerate the paper of algorithms capable of reliable deployment in industrial and safety-critical RL domains.