OGBench: Offline GCRL Benchmark

Updated 10 October 2025

OGBench is a benchmark suite for offline goal-conditioned reinforcement learning that encompasses diverse domains like locomotion, manipulation, drawing, and puzzles.
It rigorously evaluates algorithms on long-horizon planning, trajectory stitching, high-dimensional visual reasoning, and robustness to stochastic transitions using 85 datasets across 8 environments.
The suite enables multi-goal evaluation and comparative analysis, highlighting performance differences and guiding improvements in policy learning and subgoal extraction.

OGBench is a comprehensive benchmark suite developed to rigorously evaluate offline goal-conditioned reinforcement learning (GCRL) algorithms. It addresses the challenge of learning policies that can reach arbitrary goals from unlabeled, reward-free datasets, probing essential capabilities such as long-horizon planning, trajectory stitching, high-dimensional visual reasoning, and robustness to stochasticity. OGBench comprises a set of eight distinctive environment types, 85 datasets, and reference implementations of six representative algorithms, collectively designed to expose performance differences not apparent in prior RL benchmarks.

1. Design Objectives and Motivation

OGBench is purpose-built for the offline goal-conditioned setting, where the agent learns from static datasets without online environment interaction or explicit reward signals. The primary learning objective is to acquire a policy $\pi(a|s,g)$ that, given any start state $s$ and user-specified goal $g$ , enables reliable navigation from $s$ to $g$ in minimal time. Unlike legacy benchmarks, OGBench requires multi-goal generalization and supports evaluation on diverse, challenging tasks to expose the strengths and weaknesses of candidate algorithms.

The benchmark explicitly probes:

Long-horizon reasoning (multi-step planning over thousands of time steps)
Trajectory stitching (composition of short, disconnected behavioral segments)
Handling stochastic transitions and noise
Robustness to high-dimensional (pixel-based) observations

The offline evaluation protocol isolates sequencing and generalization effects, providing a robust foundation for comparative algorithm analysis.

2. Environments and Dataset Taxonomy

OGBench introduces eight major environment types, spanning three broad domains:

Locomotion: PointMaze, AntMaze, HumanoidMaze, AntSoccer. Variants involve navigation tasks for point mass, quadruped (Ant), humanoid agents, and a soccer-playing Ant.
Manipulation: Cube (pick-and-place with cubes), Scene (multi-object, drawer/button interaction, articulated manipulation), and Puzzle (Lights Out, requiring combinatorial generalization with up to $2^{24}$ discrete states).
Drawing: Powderworld, where agents perform grid-based "painting" via stochastic brush physics.

Each environment yields multiple dataset types:

Navigate: Trajectories from noisy expert policies, covering complete task structures.
Stitch: Episodes comprise short local segments, requiring agents to stitch together behaviors for full traversal.
Explore: Data collected under highly random policies, emphasizing coverage and diversity over optimality.

Datasets include both low-dimensional state representations and high-dimensional pixel inputs (e.g., $64\times 64\times 3$ RGB images). Evaluation uses several start–goal pairs per task to prevent overfitting to fixed initializations.

3. Evaluation of Algorithms

Six representative offline GCRL algorithms are implemented and tuned with high-quality code and exhaustive hyperparameter search:

Goal-Conditioned Behavioral Cloning (GCBC)
Goal-Conditioned Implicit V-Learning (GCIVL)
Goal-Conditioned Implicit Q-Learning (GCIQL)
Quasimetric RL (QRL)
Contrastive RL (CRL)
Hierarchical Implicit Q-Learning (HIQL)

Multi-goal evaluation reveals variable algorithm rankings—e.g., HIQL is powerful in long-horizon locomotion and visual manipulation, whereas CRL excels in standard locomotion, and GCIVL/GCIQL perform best in complex manipulations. Performance differences are substantial across “navigate”, “stitch”, and “explore” datasets, and policy extraction strategies such as advantage-weighted regression or DDPG+BC are benchmarked for their impact on stitching capability and robustness to suboptimal data.

4. Key Technical Challenges Probed

OGBench specifically stresses several capabilities:

Long horizon planning: Tasks like HumanoidMaze-giant ( $>$ 4000 steps) and complex Puzzles demand deep lookahead and robust temporal credit assignment.
Behavioral stitching: Stitch datasets require agents to synthesize full tasks from disconnected, local, and sometimes suboptimal behaviors.
Stochasticity: Environments such as teleport mazes include non-deterministic, abrupt transitions (e.g., “black-hole” teleportation), which test overoptimism and value-based planning under uncertainty.
Combinatorial generalization: Puzzle and Powderworld tasks contain immense configuration spaces (Lights Out with up to $2^{24}$ states).
Visual robustness: Pixel-based variants require high-dimension representation learning and policy generalization from visual sensory input.

The offline goal-conditioned RL objective is formalized as: $\mathbb{E}_{\tau \sim p(\tau|g)}\left[\sum_t \gamma^t \delta_{g}(s_t)\right]$ where $p(\tau|g)$ is the trajectory distribution for goal $g$ , and $\delta_{g}$ is the Dirac indicator activated only when $s_t = g$ .

5. Comparative Methodology and Prior Benchmarks

Compared to prior benchmarks such as D4RL (which are adapted from single-goal or non-goal-conditioned tasks), OGBench provides:

85 datasets with multi-goal evaluation per task
Explicit focus on trajectory stitching and long-horizon challenges
Rich manipulation and drawing environments alongside classical navigation
Evaluation protocols that account for stochasticity, combinatorial complexity, and high-dimensional inputs

Legacy datasets often use only a single goal, limiting insight into generalization and stitching. OGBench addresses this by presenting multi-goal evaluation, diverse data sources, and specific dataset configurations to test subgoal composition and robustness.

Benchmark	Multi-goal	Stitching Probed	Visual Tasks	Domains Spanned
D4RL	No	No	Yes	Locomotion, Manip.
OGBench	Yes	Yes	Yes	Locomotion, Manip., Drawing, Puzzle

6. Empirical Insights and Impact

OGBench results show that performance is highly algorithm-dependent across environment types and data sources. HIQL demonstrates strong results in long-horizon and hierarchical tasks, CRL excels in short-horizon locomotion, while GCIVL/GCIQL are more robust in manipulation domains. Success rates under multi-goal settings reveal weaknesses in single-shot planning and highlight the necessity for robust subgoal extraction, trajectory stitching, and effective value estimation (especially under dataset limitations and stochastic transitions).

Stochastic environments and combinatorial puzzles expose limitations of current value-only methods, which can display overoptimism or poor generalization. OGBench’s protocol and multiplicity of environments make it a rigorous platform for future algorithmic development and comparative analysis.

7. Future Directions and Research Opportunities

The paper identifies several promising future research directions:

Tackling unsolved tasks: HumanoidMaze-giant, complex puzzles, and Powderworld-hard remain challenging; algorithmic innovation is required.
Improving hierarchical planning: HIQL’s decomposition hints at value in subgoal-centric policies; can simpler nonhierarchical algorithms exploit recursive subgoal structure effectively?
Data collection: Synthetic datasets with controlled noise and coverage highlight the importance of data quality and coverage. Further work on suboptimality–performance relationships is suggested.
Hybrid approaches: Methods combining contrastive representation learning, hierarchical planning, and robust value estimation may yield improved cross-domain success.
Diagnostic metrics: Analysis of why certain domains (e.g., PointMaze) are harder, and exploration of optimal training of subgoal representations.

These axes present open challenges for both foundations in offline RL and the design of future benchmarks to probe representational and planning abilities more effectively.

OGBench is a critical infrastructure for evaluating offline goal-conditioned RL algorithms across a spectrum of real-world domains and problem settings. Its structured environment taxonomy, robust evaluation protocol, and explicit focus on essential RL challenges position it as a key standard in current and future research. The benchmark directly influences model design and diagnosis for agents intended to generalize, stitch behaviors, plan over extended horizons, and reason from unlabeled, reward-free data.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OGBench.