OGBench Suite for Offline GCRL

Updated 5 December 2025

OGBench Suite is a comprehensive benchmark for offline goal-conditioned RL that features diverse environments across locomotion, manipulation, and drawing domains.
It provides 85 meticulously designed offline datasets to probe algorithm capabilities such as stitching, long-horizon planning, high-dimensional representation, and robustness to stochastic dynamics.
The suite includes reference implementations of six state-of-the-art algorithms, ensuring reproducible, multifaceted comparisons in reinforcement learning research.

Offline Goal-Conditioned RL Benchmark (OGBench) is a systematic, large-scale benchmarking suite designed to evaluate and compare offline algorithms for goal-conditioned reinforcement learning (GCRL). OGBench provides a diverse set of realistic environments, comprehensive offline datasets, and reference implementations of state-of-the-art algorithms, enabling detailed probing of algorithmic capabilities such as stitching, long-horizon planning, high-dimensional representation learning, and robustness to stochastic dynamics. It establishes a reproducible foundation for offline GCRL algorithmic research, supporting direct, fair, and multifaceted comparisons (Park et al., 2024).

1. Environment Suite

OGBench comprises eight environment types distributed across three principal domains: locomotion, robotic manipulation, and drawing. The environments are unified by adopting the full MDP state as both the observation and goal space. Goals specify a complete desired state, and a trajectory is successful if the agent reaches the goal state within task-specific thresholds or time horizons.

The environments are as follows:

Domain	Environment	State/Action Structure	Key Variants/Features
Locomotion	PointMaze	$s\in\mathbb{R}^2$ , $a\in\mathbb{R}^2$	Medium/Large/Giant/Teleport; path lengths up to 600 steps
Locomotion	AntMaze	$s\in\mathbb{R}^{29}$ , $a\in\mathbb{R}^8$	8-DoF Ant; same layouts; teleport stochastic cells
Locomotion	HumanoidMaze	$s\in\mathbb{R}^{69}$ , $a\in\mathbb{R}^{21}$	21-DoF; up to 3,000 steps/episode
Locomotion	AntSoccer	$s\in\mathbb{R}^{42}$ , $a\in\mathbb{R}^8$	Ball-dribbling, goal on $x$ - $y$ of ball
Manipulation	Cube	State includes arm and $\leq$ 4 cubes	Variants by number of cubes (55-dim max)
Manipulation	Scene	$s\in\mathbb{R}^{40}$ , $a\in\mathbb{R}^5$	Cube, drawer, window, two buttons; high-level config goals
Manipulation	Puzzle	Up to 115-dim; $n\times m$ grid	"Lights Out" combinatorial task; deep generalization
Drawing	Powderworld	$32\times 32\times 6$ tensor	Image synthesis with $2$–$8$ powder types; stochastic physics

Pixel-based variants (Visual AntMaze, Visual HumanoidMaze, Powderworld) use RGB and structured image inputs. The environment design emphasizes scaling in complexity: episode length, input dimensionality, and task stochasticity.

2. Dataset Design

OGBench provides 85 offline datasets spanning all environments. Each dataset is generated in simulation using controlled experimental procedures designed to probe key algorithmic challenges. For each environment type and variant:

Locomotion datasets:
- navigate: Generated via a high-level waypoint planner with a soft actor-critic (SAC) low-level policy and Gaussian noise ( $\sigma=0.2$ ). 1M transitions, 1K episodes.
- stitch: Short path segments (up to 4 cells), requiring algorithms to stitch multiple segments. 1M transitions, 5K episodes.
- explore: High exploration noise ( $\sigma=1.0$ ), directions randomized every 10 steps. 5M transitions, 10K episodes.
- Pixel datasets share labels with their state-based counterparts, but observations are images.
Manipulation datasets:
- play: Open-loop scripted expert with temporally correlated noise for full coverage; 1M–5M transitions, scaling with task complexity.
- noisy: Closed-loop expert with episode-wise Gaussian noise for ablation.
Drawing datasets:
- play: Random brush strokes or fills with $p=0.5$ per update; 1M–5M transitions increasing by variant.

Difficulty is systematically increased along axes such as episode length, number of subtasks, grid size, number of interactive elements, and intrinsic stochasticity (teleporters, powder physics).

3. Reference Algorithms

OGBench distributes tuned JAX implementations of six canonical offline GCRL algorithms, each probing different learning biases:

GCBC (Goal-Conditioned Behavioral Cloning):

$J_{\mathrm{GCBC}}(\pi)=\mathbb{E}_{(s,a)\sim p^{D}(s,a),\,g\sim p^{D}_{traj}(g|s)} [\log \pi(a|s,g)]$

GCIVL (Goal-Conditioned Implicit V-Learning):
- Value estimation via expectile regression:
$\mathcal{L}_V = \mathbb{E}_{s,g\sim p^{D}_{mix}} [\ell^2_\kappa (r(s,g)+\gamma \tilde V(s',g)-V(s,g))]$

Policy extraction by value-only advantage-weighted regression (AWR):

$J^{V}_{AWR}(\pi)=\mathbb{E} [\exp(\alpha(V(s',g)-V(s,g))) \log \pi(a|s,g)]$

GCIQL (Goal-Conditioned Implicit Q-Learning):
- Joint Q/V estimation (expectile and TD):
$\mathcal{L}_V=\mathbb{E} [\ell^2_\kappa( \bar Q(s,a,g)-V(s,g) )],$

$\mathcal{L}_Q=\mathbb{E} [ ( r(s,g)+\gamma V(s',g)-Q(s,a,g) )^2 ]$

DDPG+BC policy update:

$J_{BC}(\pi)=\mathbb{E} [ Q(s,\mu(s,g),g) + \alpha \log \pi(a|s,g) ]$

QRL (Quasimetric RL):
- Learn an asymmetric distance $d(s,g)\approx -V^*(s,g)$ with triangle inequalities.
CRL (Contrastive RL):
- Binary noise-contrastive estimation of $Q^{MC}$ :
$J_{NCE}(f)=\mathbb{E}_{(s,a,g,g^-)} [\log \sigma(f(s,a,g)) + \log (1-\sigma(f(s,a,g^-)))]$

Policy improvement via AWR or DDPG+BC on $f$ .

HIQL (Hierarchical IQL):
- Hierarchical AWR with subgoal representations $\phi(s,g)$ $ϕ (s, g)$ :
  - High-level: $\pi^h(z|s,g)$ , low-level: $\pi^\ell(a|s,z)$ .

Each baseline is parameterized following the official repository, with documented training scripts per algorithm and task.

4. Evaluation Protocol and Metrics

Policies are trained purely offline, with no held-out data. Evaluation uses:

Five fixed $(\mathrm{state}, \mathrm{goal})$ pairs per task.
$N=50$ rollouts per pair, evaluated at the last three training checkpoints.
Success for a rollout is binary: 1 if the goal is reached within tolerance and before the time limit, else 0.

The aggregate success rate reported is:

$S = \frac{1}{N} \sum_{i=1}^N \mathrm{success}_i$

Discounted goal-reaching return,

$R = \sum_{t=0}^T \gamma^t r(s_t,g), \ \text{with} \ r(s,g)=\mathbb{I}\{s=g\}$

is also computed but $S$ is the primary metric. Evaluation settings are fixed and all training is conducted on the complete dataset, enforcing a uniform offline RL setup.

5. Probing Capabilities

OGBench’s environment and dataset design targets probing distinct and essential capabilities for offline GCRL algorithms:

Goal stitching: "stitch" datasets force agents to compose solutions from short, locally optimal trajectory segments (e.g., PointMaze-stitch, Puzzle tasks requiring multi-step state compositions).
Long-horizon planning: Large mazes (up to 3,000 steps), Puzzle environments demanding sequences with up to 24 actions.
High-dimensional observations: Pixel-based maze and manipulation environments, Powderworld ( $32\times32\times6$ ), challenge representation learning and policy abstraction.
Handling stochasticity: Maze teleporters and stochastic powder physics inject risk and unpredictability, exposing algorithmic robustness.

6. Experimental Results and Insights

Empirical evaluations, summarized across all tasks and baselines in (Park et al., 2024), show differentiated strengths:

HIQL achieves the highest overall performance, excelling on state-based locomotion and visual manipulation.
CRL is especially robust in stitching challenges and stochastic environments, dominating locomotion and pixel variants.
GCIQL is superior for state-based manipulation (Cube, Scene, Puzzle).
GCIVL is particularly effective for Powderworld drawing tasks.
Value-based methods (HIQL, QRL) exhibit optimistic bias under stochastic dynamics (teleporters), with CRL showing less susceptibility.
Pixel-only tasks amplify representational bottlenecks; CRL and HIQL remain comparatively robust, while GCIQL and GCBC performance declines.
Ablation on action noise during data generation in the manipulation suite reveals that moderate exploration noise ( $\sigma\approx 0.3$ ) is essential for high final success, highlighting that dataset coverage is more critical than expert optimality for offline GCRL.

7. Usage, Implementation, and Reproducibility

OGBench is distributed with installation scripts, environment wrappers, and reference implementations:

Environments: Accessible via Gymnasium API. Example:

1
2
3

import gymnasium as gym
env = gym.make("antmaze-large-navigate-v0")
state, _ = env.reset()

Datasets: Loaded as HDF5/NumPy arrays containing fields for $s$ , $a$ , $s'$ , and $done$ .

Reference algorithms: Modular scripts under ogbench/algos/. Example for GCIQL:

python -m ogbench.algos.gciql \
  --env antmaze-large-navigate-v0 \
  --dataset-mode navigate \
  --batch-size 1024 \
  --gamma 0.995 \
  --lr 3e-4 \
  --steps 1000000 \
  --policy-extract-method ddpgbc --bc-coef 0.1

Reproducibility: All dependencies and configurations are locked; full suite runs are reproducible with provided scripts, enabling comparative experiments across 85 datasets and 6 methods (∼8 GPU-days for full sweep).

OGBench establishes a rigorous and diverse platform for evaluating offline goal-conditioned RL, allowing the community to systematically identify strengths and limitations of existing and novel algorithms under a unified protocol (Park et al., 2024).

Markdown Upgrade to Chat

References (1)

OGBench: Benchmarking Offline Goal-Conditioned RL (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OGBench Suite.