OGBench Task Suite: Offline GCRL Benchmark

Updated 10 October 2025

OGBench Task Suite is a standardized, multi-domain benchmark that evaluates offline goal-conditioned reinforcement learning algorithms using diverse environments and curated datasets.
It encompasses eight environment types across locomotion, manipulation, and drawing to test long-horizon planning, skill composition, and robustness to stochasticity.
The benchmark provides six reference algorithm implementations, revealing nuanced trade-offs and guiding future research into improved representation learning and hierarchical planning.

OGBench is a standardized, multi-domain benchmark designed to systematically evaluate the capabilities of algorithms in offline goal-conditioned reinforcement learning (GCRL). In the offline GCRL paradigm, agents receive a fixed, unlabeled set of trajectories and are tasked with learning a general-purpose goal-conditioned policy $\pi(a\mid s,g)$ that reaches any state from any other state with maximal efficiency, absent extrinsic rewards. OGBench addresses the lack of a unifying benchmark in this area by organizing a diverse suite of environments, curated datasets, and reference algorithm implementations that directly probe algorithmic strengths and limitations, particularly in skill composition, long-horizon reasoning, and handling complex observations.

1. Scope of Offline Goal-Conditioned RL and Motivation for OGBench

Offline GCRL is characterized by the challenge of learning without external rewards from static datasets, relying solely on "reachability" for supervision. Unlike typical offline RL tasks that involve dense rewards and predefined targets, offline GCRL requires learning diverse, compositional behaviors purely from the structure latent in data. The learning task is domain-agnostic and unsupervised: the only implicit feedback is derived from successfully moving between arbitrary states. This framing positions OGBench as both a diagnostic tool and a basis for deeper algorithmic development by revealing nuanced strengths and weaknesses not captured by previous benchmarks.

2. Environment and Dataset Composition

OGBench encompasses eight primary environment types grouped across three domains: locomotion, manipulation, and drawing. Within these, 85 datasets have been finely tuned to probe different facets of algorithmic performance:

Domain	Environment Types	Notable Features
Locomotion	PointMaze, AntMaze, HumanoidMaze,	Maze sizes from "medium" to "giant"; teleporters for stochasticity;
	AntSoccer	Up to 4000 steps per episode; "navigate," "stitch," and "explore" variants
Manipulation	Cube, Scene, Puzzle	Multi-object pick-and-place; sequential "recipe" tasks; Lights-Out–style puzzle grids up to 4×6
Drawing	Powderworld	32×32 grid; diverse "powders"; high-dimensional, stochastic, and combinatorial tasks

Each environment includes both state-based and pixel-based (e.g., $64\times64$ RGB) observation modalities. Locomotion tasks cover a spectrum from simple mazes to extended 21-DoF humanoid navigation. Manipulation tasks require both atomic action sequencing and combinatorial reasoning over objects and grids. The Powderworld drawing task introduces highly stochastic, high-dimensional planning by requiring agents to reconstruct images from random, exploratory data. Dataset diversity is achieved through variations such as "navigate" versus "stitch" for locomotion, and "play" versus "noisy" variants for manipulation, systematically altering state coverage, optimality, and noise levels.

3. Reference Algorithms and Evaluation Protocol

OGBench provides reference implementations for six representative offline GCRL algorithms:

Goal-Conditioned Behavioral Cloning (GCBC): A supervised baseline that minimizes

$J_{\mathrm{GCBC}}(\pi) = \mathbb{E}_{(s,a)\sim D,\, g\sim p_{\mathrm{traj}}(g|s)} [\log \pi(a\mid s, g)],$

relabeling future states as goals.

Goal-Conditioned Implicit V-Learning (GCIVL) / Q-Learning (GCIQL): Adapt implicit Q-learning via expectile regression, optimizing

$\mathcal{L}_{\mathrm{GCIVL}}(V) = \mathbb{E}_{s, g} \left[ \ell_{\kappa}^2 \big(r(s, g) + \gamma \overline{V}(s', g) - V(s, g)\big) \right],$

where $\ell_{\kappa}^2$ is the expectile loss.

Quasimetric RL (QRL): Fits a quasimetric $d(s, g)$ satisfying $d(s, w) + d(w, g) \geq d(s, g)$ , using path distances (optimal value is $-d(s, g)$ ) and trains a dynamic model for continuous actions.
Contrastive RL (CRL): Employs a binary noise-contrastive objective:

$J_{\mathrm{CRL}}(f) = \mathbb{E}_{(s,a)\sim D, \ g\sim p_{\mathrm{geom}}(g|s,a), \ g^-\sim p_{\mathrm{rand}}(g)} [\log \sigma(f(s,a,g)) + \log (1 - \sigma(f(s,a,g^-)))]$

to fit value functions and employs DDPG+BC-style policy extraction in continuous spaces.

Hierarchical Implicit Q-Learning (HIQL): Extends GCIVL with a two-level hierarchy: a high-level policy generates latent subgoals (via a function such as $\phi(s, g)$ ), and a low-level policy produces actions conditioned on these subgoals.

Each algorithm is evaluated across all tasks, with multi-goal rather than single-goal performance serving as the central metric to capture generalist competence.

4. Probing Algorithmic Capabilities

OGBench's design targets four distinctive competencies:

Goal Stitching: Datasets labeled as "stitch" require agents to compose previously unseen or disjoint trajectory fragments, facilitating long-range transitions despite individual data segments being short and local.
Long-Horizon Planning: Tasks such as HumanoidMaze and large Puzzle grids necessitate planning over hundreds to thousands of time steps, reflecting real-world RL challenges of temporal abstraction and recursion.
Representation Learning from High-Dimensional Inputs: Providing pixel-based and high-dimensional state observations (e.g., in Powderworld and Visual AntMaze) tests the ability of algorithms to extract relevant state features and generalize from sparse or lossy input streams.
Robustness to Stochasticity: Stochasticity is introduced explicitly (e.g., teleporters in mazes) and inherently (e.g., Powderworld’s powder dynamics), evaluating algorithms' capabilities to account for environmental randomness and uncertainty.

The benchmark distinguishes itself by its systematic control over the above axes, offering a granular lens on generalization, sample efficiency, and robustness.

5. Empirical Outcomes and Comparative Performance

Empirical evaluation on OGBench exposes nuanced trade-offs among algorithms:

HIQL consistently achieves top-tier results on locomotion and visual manipulation tasks, attesting to the advantage of explicit hierarchical policy extraction and subgoal modeling for both long-horizon and compositionally complex tasks.
CRL demonstrates pronounced robustness in pixel-based situations (e.g., Visual AntMaze), leveraging its contrastive objectives for effective representation learning. Its advantage increases in high-dimensional or visual settings.
GCIVL/GCIQL perform comparatively well in manipulation environments but tend to falter in high-stochasticity or extreme long-horizon domains such as "giant" and "teleport" mazes.
In compositional puzzle and drawing environments, all value-learning and contrastive methods suffer clear performance degradation as task dimensionality grows. For instance, while 3×3 puzzles are tractable, 4×6 grids expose substantial generalization and planning bottlenecks.

A central insight is that apparent similarities in aggregate metrics on prior, less varied benchmarks are replaced in OGBench by a landscape of divergent strengths and weaknesses, enhancing diagnostic power for algorithmic progress.

6. Open Problems and Future Research Directions

Several open directions emerge from the OGBench paper:

Unified Approaches: No single algorithm is uniformly optimal; HIQL's hierarchy is beneficial for locomotion/manipulation, but struggles on certain pixel-based drawing tasks. Combining strengths of robust representation learning (e.g., CRL) with hierarchical decomposition remains an open direction.
Learning Subgoal Representations: For hierarchical algorithms, devising effective subgoal extraction from high-intrinsic-dimensional or pixel-level data is unresolved.
Dataset Construction: The influence of state coverage and dataset noise, controlled within OGBench by variants like "play," "stitch," and "explore," suggests further investigation into dataset design for optimal skill acquisition.
Long-Horizon Planning: Performance bottlenecks in very long-horizon domains (HumanoidMaze-giant, large Puzzle grids) indicate the need for planning or hierarchical solutions that exploit subtask recurrence or hierarchical structure.
Necessity of Full RL: The relative competitive success of behavioral cloning methods (when coupled with strong representation learning) poses the question of whether full RL machinery is always necessary or if one-step improvements suffice, especially in rich but unlabeled datasets.

7. Significance and Impact of OGBench

OGBench establishes itself as a foundational benchmark for offline goal-conditioned RL research by unifying diverse task types, calibrated datasets, and rigorous evaluation metrics. By decomposing task structure across environment types and dataset properties, and evaluating algorithms on multi-goal competence, OGBench enables principled assessment of generalization, compositionality, and robustness. Its fine-grained exposure of algorithmic deficiencies and strengths provides crucial guidance for future method development—particularly in the direction of combination algorithms, better representation learning, and the paper of dataset-induced generalization effects.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OGBench Task Suite.