Goal-Conditioned Offline RL

Updated 18 March 2026

Goal-conditioned offline RL is a framework that trains policies from fixed datasets to achieve arbitrary goals, handling sparse rewards and multi-modal challenges.
Techniques like advantage-weighted behavioral cloning, dual-advantage weighting, and reachability sampling address issues of distribution shift and long-horizon tasks.
Empirical studies show that integrating hierarchical planning, graph-based methods, and data augmentation boosts performance in safety-critical and high-dimensional tasks.

Goal-conditioned offline reinforcement learning (GCRL) is an area of reinforcement learning in which an agent is trained—using only fixed, pre-collected datasets—to produce policies capable of reaching user-specified goals. Unlike standard offline RL, in GCRL the agent must generalize over a space of goals, often under sparse rewards and challenging learning conditions. The field has undergone rapid development, driven by advances in value-function learning, representation theory, planning, generative modeling, data augmentation, and methods for handling long-horizon or safety-critical tasks.

1. Formal Problem Setting and Challenges

A goal-conditioned offline Markov decision process (MDP) is defined by the tuple

$(\mathcal{S}, \mathcal{A}, \mathcal{G}, P, R, \gamma)$

where $\mathcal{S}$ is the state space, $\mathcal{A}$ the action space, $\mathcal{G}$ the goal space, $P(s’|s,a)$ the transition kernel, $R(s,a,g)$ a typically sparse goal-conditioned reward, and $\gamma$ a discount factor. The agent must learn a policy $\pi(a|s,g)$ from a fixed dataset $\mathcal{D}$ , with no access to new environment interactions. The objective is to maximize the expected discounted return to arbitrary goals:

$J_{\mathrm{GCRL}}(\pi) = \mathbb{E}_{g \sim P_g, s_0 \sim P_0, a_t \sim \pi(\cdot|s_t,g), s_{t+1} \sim P}\left[\sum_{t=0}^T\gamma^tR(s_{t+1},g)\right].$

Primary challenges in offline GCRL include:

Distribution shift: The policy may encounter state–action–goal distributions not well-covered in data, leading to overestimation in value learning and poor generalization.
Multi-modality: For distant goals, there may be multiple, potentially conflicting routes; finding policies that reliably select optimal or feasible solutions is non-trivial.
Sparse rewards and long horizons: Credit assignment and horizon length exacerbate value estimation and behavioral cloning with insufficient supervision.

2. Core Methods in Offline GCRL

2.1. Advantage-Weighted Behavior Cloning and Dual-Advantage Weighting

The foundational approach is advantage-weighted behavioral cloning, where supervised policy learning is modulated by goal-conditioned advantage weights,

$L_{\mathrm{AWLL}}(\pi)=-\mathbb{E}_{(s,a,g)\sim\mathcal{D}}\left[w(s,a,g)\,\log\pi(a|s,g)\right], \quad w(s,a,g)=\exp_\mathrm{clip}(A^{\pi_b}(s,a,g)).$

This produces monotonic improvement guarantees for supported (in-distribution) goals. However, single advantage-weighted schemes struggle with multi-modalities in long-horizon tasks. To address this, Dual-Advantage Weighted Offline Goal-conditioned RL (DAWOG) introduces a value-based partitioning of state-space and defines target-region advantages that induce inductive bias toward stepping through easier-to-reach regions. The resulting reweighting,

$w(s,a,g) = \exp_\mathrm{clip}\left[ \beta\,A^{\pi_b}(s,a,g) + \tilde\beta\,\widetilde{A}^{\pi_b}(s,a,G(s,g)) \right],$

guarantees monotonic improvement over the behavior policy and improves convergence in long-horizon, multi-modal settings (Wang et al., 2023).

2.2. Offline Actor-Critic, Deterministic/AWR-Q-Gradient, Data Augmentation

Methods such as TD3+BC, actor-critic with behavior cloning regularization, and deterministic Q-advantage policy gradient (DQAPG) combine temporal-difference-based value estimation with advantage-weighted policy updates. Goal-swapping augmentation, in which goals are swapped across trajectories to artificially expand the set of state–goal pairs, increases support for value and policy function approximation, with noisy augmentations automatically filtered by advantage-weighting. These methods offer robust improvement, particularly when combined with hindsight experience relabeling (HER) and other data enrichment schemes (Yang et al., 2023).

2.3. Reachability-Weighted Sampling

Techniques such as Reachability-Weighted Sampling (RWS) explicitly modulate the sampling of $(s,a,g)$ tuples for training by leveraging a reachability classifier, trained via positive-unlabeled (PU) learning on Q-values, to upweight transitions likely to reach goals. This improves data efficiency and supports better stitching in manipulation and dexterous hand control tasks, with up to 50% improvement on challenging settings compared to uniform relabeling (Yang et al., 3 Jun 2025).

2.4. Mixture and State-Occupancy Matching

Occupancy-matching perspectives, exemplified by GoFAR and SMORe, recast GCRL as divergence minimization between the induced occupancies of the learned policy and (often implicit) target expert or goal-reachable distributions. Convex dual formulations allow for decoupled value and policy optimization, stable learning, and, in the case of SMORe, discriminator-free unnormalized scoring of action-goal importance, leveraging all suboptimal data through mixture distributions (Sikchi et al., 2023, Ma et al., 2022).

2.5. Model-Based Planning and Diffusion Models

Recent innovations exploit learned models for model-based planning (GOPlan, SSD), where high-advantage diffusion models or conditional GANs generate "imagined" trajectories, expanding support and providing relabeled, high-quality samples for finetuning. Diffusion-based planners in SSD and GODA use value or return-based goal conditioning to stitch sub-trajectories and create high-return augmented data that maximize the impact of limited demonstrations (Wang et al., 2023, Kim et al., 2024, Huang et al., 2024).

3. Hierarchical, Representation, and Planning Approaches

3.1. Hierarchical Decomposition and Subgoal Extraction

Hierarchical methods—such as HIQL, Option-aware Temporally Abstracted (OTA) value learning, and projective quasimetric planning (ProQ)—decompose the goal-reaching task into high-level subgoal selection and low-level skills. HIQL trains an action-free, goal-conditioned value function; subgoals (or their latent representations) are treated as "actions" at the high-level, with policies extracted via advantage-weighted regression. These methods achieve substantial robustness to value noise and compounding error in long-horizon problems, with option-aware TD learning (OTA) reducing the effective horizon by leveraging temporally abstracted transitions (Park et al., 2023, 2505.12737, Kobanda et al., 23 Jun 2025).

3.2. Quasimetric and Successor Feature Representations

Representation learning frameworks based on quasimetrics (triangle-inequality respecting distances) or successor features—sometimes unified into a single model—provide the ability to measure progress toward goals and enable multi-hop planning. Methods such as Temporal Metric Distillation (TMD) and Quasimetric RL (QRL) enforce compositionality and globally consistent distances in embedding space, which in turn allows planning by segmenting long trajectories into reliable "hops" (Myers et al., 24 Sep 2025).

3.3. Planning via Graph Search and Value Aggregation

Test-Time Graph Search (TTGS) and value-aggregation methods build graphs over dataset states using distances derived from learned value functions, then perform shortest-path search at inference time to deliver reliable subgoals to a fixed policy. This approach addresses credit assignment by ensuring each "hop" remains within the policy's competence radius, and is especially effective in long-horizon, sparse-reward benchmarks (OGBench), with success rates jumping from 0% to 80%+ on giant stitching mazes (Opryshko et al., 8 Oct 2025).

4. Extensions: Safety-Critical, Geometric/Physics-Informed, and Data Efficiency

4.1. Safety-Constrained GCRL

For safety-critical domains, Recovery-based Supervised Learning (RbSL) decomposes the policy into goal-reaching and recovery components, learning a cost-to-go critic and using a switching mechanism to ensure constraint satisfaction during policy execution. This method yields the lowest cost return and highest success on obstacle-rich real and simulated robotics tasks (Cao et al., 2024).

4.2. Geometric and Physics-Informed Value Learning

In domains where the cost-to-go exhibits geometric structure, physics-informed value learning regularizes the value function to behave as a distance field by enforcing Eikonal PDE constraints (i.e., constant gradient norm). Applied within the HIQL framework (Pi-HIQL), this inductive bias produces pronounced gains in large, long-horizon navigation tasks and in "stitching" regimes on OGBench (Giammarino et al., 8 Sep 2025).

4.3. Data Augmentation

Diffusion-based data augmentation approaches such as GODA introduce return-oriented goal conditioning and adaptive gated neural architectures to generate high-quality trajectories, especially when the dataset is suboptimal or limited in coverage. Controllable scaling enables the generation of higher-return samples, and the augmented data can be consumed by any standard offline RL algorithm (Huang et al., 2024).

5. Benchmarks and Empirical Findings

OGBench provides a comprehensive benchmark suite for offline goal-conditioned RL, with challenging tasks probing stitching, generalization, stochasticity, and high-dimensional observation processing. Major conclusions from recent empirical studies include:

Hierarchical and compositional approaches (HIQL, ProQ, Pi-HIQL, OTA) consistently outperform flat policy extraction, particularly under long horizons and in unfamiliar goal regimes (Park et al., 2024, 2505.12737, Kobanda et al., 23 Jun 2025, Giammarino et al., 8 Sep 2025).
Graph-based planning (TTGS, graph-aggregated values) and generative modeling (SSD, GOPlan, GODA) excel in sparse or fragmented datasets by enabling efficient "goal stitching."
Methods leveraging strong representation regularization (quasimetrics, occupancy-matching) and robust goal relabeling (HER, goal-swapping, reachability-weighting, return-prior selection) demonstrate improved stability and data efficiency.
Explicit countermeasures against overestimation, compounding Bellman errors, and support drift are essential for robust offline GCRL performance across all regimes.

6. Theoretical Guarantees and Open Directions

Recent work has established polynomial sample complexity for GCRL under general function approximation and minimal data coverage assumptions (i.e., single-policy concentrability, without minimax optimization) (Zhu et al., 2023). Further, mixture-distribution matching and dual occupancy-matching frameworks admit convex dual objectives, stable optimization, and, in some settings, explicit statistical performance bounds and monotonic policy improvement (Sikchi et al., 2023, Ma et al., 2022).

Outstanding open problems include principled methods for addressing severe data miscoverage, scalable solutions for partial observability or domain shift (e.g., sim2real), integrated planning and value learning in high-dimensional pixel or language-based goals, and compositional hierarchies operating under sparse rewards and high stochasticity.

7. Summary Table: Key Families of Offline GCRL Methods

Method Family	Core Mechanism	Notable Papers
Advantage-Weighted (Single/Dual)	AWLL, DAWOG, GEAW	(Wang et al., 2023)
Occupancy/Distribution Matching	f-Advantage, Mixture Dual, SMORe	(Sikchi et al., 2023, Ma et al., 2022)
Model-Based Planning & Generation	Diffusion, GAN prior, Reanalysis	(Wang et al., 2023, Kim et al., 2024)
Hierarchical, Option-Aware	HIQL, OTA, ProQ, graph search	(Park et al., 2023, 2505.12737, Kobanda et al., 23 Jun 2025, Opryshko et al., 8 Oct 2025)
Data Augmentation	Goal-swapping, reachability weighting	(Yang et al., 2023, Yang et al., 3 Jun 2025)
Safety-Constrained, Physics-Informed	RbSL, Pi-HIQL	(Cao et al., 2024, Giammarino et al., 8 Sep 2025)

The landscape of goal-conditioned offline RL is characterized by a convergence of value-based, model-based, and generative paradigms, with empirical and theoretical insights increasingly informed by geometric inductive bias, compositional hierarchies, and robust support estimation. Continued progress will depend on further integrating these principles to address the unique challenges of offline, high-dimensional, and generalizable goal-reaching.