D4RL Benchmark: Offline RL Evaluation

Updated 4 September 2025

D4RL Benchmark is a suite of offline RL tasks characterized by static datasets that mirror challenges like biased distributions, sparse rewards, and trajectory stitching.
It integrates diverse domains—from robotic manipulation to traffic simulations—enabling comprehensive comparisons of offline and batch RL methods.
Its unified evaluation protocol with normalized scoring exposes algorithm deficiencies and drives advancements in sample efficiency and generalization.

The D4RL benchmark is a standardized suite of offline reinforcement learning (RL) tasks designed to enable rigorous evaluation and comparison of data-driven RL algorithms. It targets the full batch setting, where agents learn solely from static datasets without access to online environment interactions, providing realistic scenarios for domains where collecting new data is expensive, risky, or infeasible. D4RL is motivated by the need to accurately reflect the complexity, diversity, and limitations encountered in real-world offline RL, including biased, narrow, and heterogeneous data sources, tasks requiring trajectory composition or "stitching," and settings featuring non-Markovian behavior.

1. Motivation and Benchmark Design

D4RL was introduced to address critical deficiencies in prior benchmarks for offline RL, which predominantly used replay buffers from partially trained agents in online RL settings. These earlier benchmarks inadequately captured the unique challenges of the offline regime—most notably, extreme distributional shift and lack of exploratory trajectories. D4RL pursues a realistic representation of offline data by including:

Datasets generated by hand-designed controllers and human demonstrators
Multitask datasets requiring compositional skill transfer
Data collected under mixtures of policies (expert, suboptimal, undirected)

Distinguishing itself from online RL benchmarks, D4RL restricts agents to pre-collected static datasets, emphasizing evaluation protocols and dataset properties that expose the shortcomings of contemporary offline RL algorithms. Tasks are chosen to test abilities such as trajectory "stitching," policy generalization, and robustness to varied data distributions.

2. Dataset Properties and Domain Coverage

D4RL comprises multiple domains with heterogeneous sources and collection strategies:

Domain	Data Collection Methods	Notable Features
Maze2D / AntMaze	High-level planners + low-level controllers	"Stitching" via subtrajectories, sparse rewards
Gym-MuJoCo (Hopper etc)	RL agent policies (expert, medium, replay)	Controlled data coverage, performance gradation
Adroit, FrankaKitchen	Human demonstrations, cloned policies	Non-Markovian/demonstration bias
Flow Traffic, CARLA	Mixtures of behavior policies	Multi-agent, real-world simulation

By selecting datasets with narrow expert trajectories, multimodal distributions, and multitask control (e.g., FrankaKitchen), D4RL systematically exposes performance gaps in offline RL methods. The benchmark is explicitly designed to diagnose sample efficiency, distributional shift, reward sparsity, and algorithmic robustness.

3. Evaluation Protocols and Scoring

D4RL recommends a unified evaluation protocol based on normalized scores. The normalization formula is:

$\text{normalized score} = 100 \times \frac{\text{score} - \text{random score}}{\text{expert score} - \text{random score}}$

This ensures a score of $0$ for random policy performance and $100$ for expert policy, enabling task-agnostic comparison. To discourage overfitting, tasks are partitioned into distinct sets for hyperparameter tuning ("training") and final evaluation ("test"). The protocol specifies strict reporting practices (e.g., consistent seeds, dataset documentation) and includes open-source examples for reproducibility.

By spanning navigation (Maze2D, AntMaze), control (Gym-MuJoCo), robotic manipulation (Adroit, FrankaKitchen), and traffic (Flow, CARLA), D4RL enforces comprehensive, domain-agnostic algorithm assessment.

4. Algorithmic Evaluation and Observed Deficiencies

The benchmark provides performance evaluation for leading offline RL methods including:

Behavioral Cloning (BC)
Soft Actor-Critic and offline variants (SAC, BEAR, BRAC)
Batch-Constrained Q-learning (BCQ)
Advantage-Weighted Regression (AWR)
cREM, AlgaeDICE

In tasks using online RL-generated data (e.g., expert Gym-MuJoCo), many algorithms match or slightly exceed behavior policy performance. However, on tasks with narrow distributions, subtrajectory "stitching," or sparse rewards (Maze2D, AntMaze, Adroit), prominent methods frequently fail, especially those heavily relying on distribution matching. Conservative algorithms (e.g., BEAR, BCQ) outperform unconstrained ones under highly biased datasets, though overall sample efficiency and generalization remain problematic.

The benchmark's evaluations reveal unappreciated limitations—such as inadequate exploration capacity, brittle performance under data distribution shift, and poor policy stitching—in contemporary offline RL algorithms.

5. Extensions and Impact on Research

D4RL catalyzed a significant refocus in offline RL research, shifting emphasis toward:

Algorithmic robustness to narrow, biased, and multitask data
Improved off-policy evaluation, especially in low-coverage regimes
Handling non-stationary and partial observability issues

The paper suggests further extensions, including benchmarking environments with high stochasticity (e.g., healthcare, finance), and domains featuring large action/observation spaces (e.g., recommender systems). There is a clear call for methods capable of reliable off-policy evaluation and sample-efficient learning from limited demonstration data. The benchmark—by design—accelerates research progress by providing diagnostic tasks, reproducible protocols, and highlighting persistent deficiencies in the corpus of offline RL algorithms.

6. Mathematical Framing and Data Visualization

All D4RL tasks are formulated as Markov Decision Processes (MDPs) $\left(\mathcal{S}, \mathcal{A}, P, R, p_0, \gamma\right)$ with the offline RL objective:

$J(\pi) = \mathbb{E}_{\pi, P, p_0} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]$

Figures and data visualizations in the benchmark illustrate complex task layouts (e.g., maze configurations, trajectory heatmaps, scene renderings from CARLA), which are essential for understanding the scale and diversity of the benchmark. These visualizations spotlight task-specific challenges, such as compositional navigation and manipulation complexity, and exemplify the need for algorithmic generalization.

7. Legacy and Future Directions

The introduction of D4RL established a rigorous platform for evaluating offline RL algorithms under realistic, high-stakes conditions where dataset limitations and domain complexity predominate. By highlighting gaps in existing algorithms and standardizing both datasets and evaluation protocols, D4RL has steered the field toward the development of more robust, generalizable, and practically applicable methods.

Current and future work aims to extend the benchmark with new domains, more diverse and realistic datasets, and enhanced protocols for assessing both reward maximization and constraint (e.g., safety, cost) satisfaction. The benchmark remains foundational for bridging academic RL with real-world deployment scenarios where large pre-collected datasets are available, but online exploration is impractical or hazardous.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to D4RL Benchmark.