D4RL Kitchen Benchmark: Offline RL Testbed
- D4RL Kitchen Benchmark is a collection of datasets and evaluation protocols for offline RL in multi-task, high-dimensional control settings using human demonstrations.
- It offers three diverse dataset variants—kitchen-complete, kitchen-partial, and kitchen-mixed—each presenting unique challenges in sub-task composition and reward sparsity.
- Baseline results, particularly from CQL, highlight the benchmark’s role in exposing algorithmic challenges like value overestimation in sparsely rewarded, multimodal tasks.
The D4RL Kitchen (FrankaKitchen) benchmark is a suite of datasets and evaluation protocols designed to assess offline reinforcement learning (RL) algorithms in high-dimensional, sparse-reward, multi-task continuous control settings. It features demonstrations of a 9-DOF Franka Panda robotic arm manipulating five objects in a simulated kitchen, providing a challenging testbed for generalization, sub-policy composition, and value conservatism in the offline RL regime (Fu et al., 2020).
1. Environment and Task Definition
The environment simulates a Franka Panda arm on a kitchen counter, interacting with five distinct objects: a microwave door, a sliding cabinet door, a kettle (requiring grasp and placement), an overhead light switch, and a stove knob. Each episode begins with all objects reset to their “home” configuration. The overarching objective is to bring the scene to a target configuration by sequentially manipulating some or all objects—successfully opening doors, flipping switches, and placing the kettle as specified. Because each episode involves five independent object interactions and any ordering is admissible, the resulting task is highly multi-modal.
The observation space generally comprises:
- 9 arm joint angles
- 9 arm joint velocities
- 5 real-valued variables describing the states of each object (such as door angle, switch position, knob turn, kettle grip status)
Empirically, . The continuous action space, , corresponds to the arm's 9 joint velocities.
2. Dataset Design and Collection
Three datasets are distributed, all constructed from human teleoperation demonstrations—no RL agents or algorithmic controllers are involved. These demonstrations originated in earlier work (Gupta et al. 2019) and were repackaged for the D4RL release.
- kitchen-complete: Contains demonstrations where all five sub-tasks are successfully completed in sequence per episode (3,680 transitions).
- kitchen-partial: Comprises undirected human demonstrations; some episodic trajectories are complete, but many solve only a subset of the five sub-tasks (136,950 transitions).
- kitchen-mixed: Also undirected, but no episode completes all five sub-tasks, requiring agents to “stitch” distinct sub-trajectories to complete entire tasks (136,950 transitions).
Each dataset consists of transitions . The average episode length is approximately 100–200 steps; thus, kitchen-complete likely contains dozens of complete demonstrations, while the other datasets contain on the order of a thousand shorter demonstrations. State-action coverage analysis (see Appendix E of (Fu et al., 2020)) shows that demonstration trajectories densely visit object “affordance zones,” ensuring that critical states are well-represented, particularly in datasets requiring sub-task composition.
3. Reward Structure, Returns, and Metrics
A sparse reward schema is adopted: each sub-task yields upon completion (e.g., successfully opening the microwave door), for a theoretical maximum per-episode return . In kitchen-mixed, most episodes achieve partial credit with . Average return over trajectories is:
The main evaluation metric is the normalized score (range [0, 100]), defined as:
where is the policy’s average (undiscounted) return over 100 evaluation episodes, the value for a uniformly random policy, and per episode. An optional metric, success rate, measures the fraction of episodes where all five sub-tasks are completed:
No separate train/test splits are defined; each dataset is used for both training and evaluation. For kitchen-mixed, use of additional simulator rollouts for hyperparameter tuning is discouraged if the dataset is to serve as a held-out test set.
4. Baseline Results and Comparative Analysis
Baseline results (see Table below, sourced from Table A3 of (Fu et al., 2020)) cover several offline RL algorithms, reporting the normalized score (higher is better; averaged over three random seeds).
| Method | kitchen-complete | kitchen-partial | kitchen-mixed |
|---|---|---|---|
| BC (Behavioral Cloning) | 33.8 | 33.8 | 47.5 |
| SAC-off (Offline SAC) | 15.0 | 0.0 | 2.5 |
| BEAR | 0.0 | 13.1 | 47.2 |
| AWR | 0.0 | 15.4 | 10.6 |
| BCQ | 8.1 | 18.9 | 8.1 |
| CQL (Conservative Q-Learning) | 43.8 | 49.8 | 51.0 |
Several key patterns emerge:
- Conservative Q-Learning (CQL) achieves the highest scores across all Kitchen datasets, outperforming BC by – points and demonstrating effective generalization and value control, particularly in multimodal, sparse-reward domains.
- Offline SAC generally collapses, especially in kitchen-partial and kitchen-mixed, indicating difficulty with distributional mismatch in the highly multimodal, sparse-reward regime.
- BEAR performs strongly on the mixed dataset (comparable to BC) but achieves near-zero score on kitchen-complete, presumably due to over-regularization when the data distribution is narrow.
- Behavioral Cloning (BC) is a strong baseline where full demonstrations exist but cannot recover “stitched” solutions in mixed datasets lacking entire sub-task completion trajectories. CQL’s penalty mechanism enables it to avoid over-optimistic value estimates, enabling better sub-task composition.
This performance spread highlights the challenge posed by FrankaKitchen: effective offline RL on these datasets requires balancing conservatism (to avoid Q-overestimation from out-of-distribution actions) with the capacity to generalize across and compose diverse sub-trajectories.
5. Significance and Research Implications
D4RL Kitchen advances evaluation in offline RL beyond earlier benchmarks by introducing:
- Multi-modal, multi-object tasks requiring complex planning and policy composition
- Human demonstration data, ensuring that trajectory distributions are realistic and that sub-task coverage is dense across affordance zones
- Dataset variants emphasizing “stitching,” challenging algorithms to generalize and merge sub-skills not present contiguously in any single trajectory
These characteristics expose limitations in prior offline RL algorithms and provide direct targets for improvement. The Kitchen tasks reflect authentic challenges associated with applying deep RL in structured, goal-conditioned robotic environments, particularly in the absence of online interaction for policy refinement.
6. Future Directions and Open Challenges
The D4RL Kitchen suite remains a significant testbed for several challenges:
- Sub-trajectory composition: Achieving high performance in mixed datasets requires algorithms to generalize beyond demonstrated behavior, composing sub-policies across tasks not present in individual episodes.
- Reward sparsity and hierarchy: The sparse, decomposable reward structure emphasizes the need for exploration and multi-step credit assignment under distributional shift.
- Evaluation under partial coverage: No predefined train/test split places the onus on rigorous methodology and fair model selection to avoid overfitting to held-out sub-tasks.
A plausible implication is that progress on D4RL Kitchen will extend to practical offline RL for real-world robotics, where data limitations and distributional mismatch are inherent. The benchmark’s adoption may drive the development of algorithms robust to multimodal trajectories and reward sparsity, while its emphasis on human demonstrations aligns it with industrial applications.
All technical details and performance metrics are as reported in "D4RL: Datasets for Deep Data-Driven Reinforcement Learning" (Fu et al., 2020).