Goal-Conditioned Policies

Updated 26 February 2026

Goal-conditioned policies are policies explicitly conditioned on goal inputs, enabling agents to generalize across multiple tasks and overcome sparse reward challenges.
They leverage various architectures such as fixed-network concatenation, hypernetworks, and diffusion models to enhance control performance and represent goals effectively.
These policies find practical applications in robotic manipulation, navigation, and hierarchical planning, often utilizing techniques like HER, metric learning, and intrinsic motivation.

A goal-conditioned policy (GCP) is a parameterized policy $\pi(a \mid s, g)$ that outputs a distribution over actions $a$ given the current state $s$ and a specified goal $g$ . Unlike classical RL policies, which optimize for a task-implicit reward or a single reward function, GCPs are explicitly conditioned on goal specifications, enabling a single agent to generalize across many tasks or objectives. This paradigm is central to controlling agents in multi-task, sparse-reward, and hierarchical environments, with application domains ranging from robotic manipulation and navigation to general-purpose skill learning. Goal-conditioning can refer to a wide spectrum of formulations, from state-based or perceptual goal definitions to conditioning on goal distributions, trajectories, or subgoal sequences. The field has over two decades of mathematical, algorithmic, and practical development, spanning model-free, model-based, offline, imitation, and planning settings.

1. Mathematical and Algorithmic Foundations

Goal-conditioned control is formally posed in the goal-conditioned Markov Decision Process (GCMDP), where the reward $r(s, a, s', g)$ is a function of the desired goal $g$ . The policy is trained to maximize the expected return with respect to the distribution over goals: $J(\pi) = \mathbb{E}_{g\sim p(g),\,\tau\sim\pi(\cdot\mid g)} \left[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t, s_{t+1}, g) \right]$ Common reward structures include indicator (sparse) rewards, distances in state or latent space, goal distributions, and transitions specified by logical predicates. Learning procedures employ supervised imitation (behavioral cloning), actor-critic RL, Q-learning, advantage-weighted likelihood, and information-theoretic objectives, often leveraging techniques such as Hindsight Experience Replay (HER) or goal relabeling to mitigate reward sparsity.

A key distinction is whether goal conditioning is implemented by (1) concatenating $g$ to $s$ within a fixed policy network, (2) using a hypernetwork to generate policy parameters from $g$ ", or (3) representing goals as latent embeddings or distributions. The interface between goal representation and policy expressivity is nontrivial and has theoretical consequences for value sufficiency, action sufficiency, and transferability (Hyeon et al., 30 Jan 2026, Lawrence et al., 6 Dec 2025).

2. Policy Architectures and Goal Representation

Approaches to GCPs range from classical universal value function approximators to transformer-based and hypernetwork architectures. Notable architectural paradigms include:

Fixed-network concatenation: Standard MLPs or transformers taking $[s, g]$ as input, as in GCBC, C-BeT, or DAWOG (Wang et al., 2023).
Hypernetworks: The policy parameters themselves are generated via a hypernetwork as a function of goal encoding (e.g., a goal image or latent vector), as in Hyper-GoalNet (Zhou et al., 26 Nov 2025) and GoGePo (Faccio et al., 2022). Hyper-GoalNet, for instance, conditions on learned latent representations of current and goal images, with the hypernetwork producing task-specific policy weights, offering improved adaptability and robustness in diverse and high-variability robotic settings.
Score-based diffusion models: Generative architectures where action distributions are represented via score-based diffusion models, e.g., BESO, to capture the multi-modality of goal-conditioned play data and allow flexible policy sampling (Reuss et al., 2023).

The choice of goal representation critically impacts control performance. Recent theoretical work has shown that learning goal representations solely for value estimation (value sufficiency) can collapse action-relevant distinctions, whereas representations optimized via the policy (log-loss) objective preserve critical action-sufficient information (Hyeon et al., 30 Jan 2026). Distance functions, metric learning, contrastive representations, and learned latent spaces are prominent in both state-based and image-based GCPs (Reichlin et al., 2024, Höftmann et al., 2023).

3. Methods for Goal-Conditioned Policy Learning

Model-Free and Model-Based Learning

Supervised imitation: Behavioral cloning is frequently utilized when expert demonstrations are available, with goal relabeling for data efficiency.
Actor-critic RL: Policies and critics are trained with goal as part of the context. Methods such as DAWOG (Wang et al., 2023) use dual-advantage weighting to address distribution-shift and multi-modality.
Reward shaping and dense rewards: GO-FRESH employs self-supervised latent space learning to construct dense reward signals via latent distances and transition predictability, improving long-horizon and sparse-reward performance (Mezghani et al., 2023).
Metric learning: Methods such as MetricRL explicitly learn latent spaces satisfying distance-monotonicity; the value function is constructed from these learned metrics, guiding policy improvement without further Bellman backup (Reichlin et al., 2024).

Model-Based Methods and Planning

Dynamic planning with GCPs: Frameworks such as GOPlan (Wang et al., 2023) combine multi-modal policy learning with model-based planning and reanalysis, generating imagined goal-reaching rollouts for fine-tuning.
Planning with learned GCPs: Hierarchical schemes (HGCPP) build a persistent plan-tree of high-level actions (HLAs), each corresponding to short-horizon GCPs, and employ MCTS for high-level reasoning (Rens, 3 Jan 2025).
Graph-based planning: Integration of graph planners produces subgoal sequences, which can be distilled into the policy via self-imitation (PIG) (Kim et al., 2023).

Unsupervised and Intrinsic Motivation

Intrinsic goal discovery: Skill discovery and imitation of latent-goal trajectories using unsupervised intrinsic motivation frameworks (e.g., GPIM), where a discriminator or mutual information encourages coverage of diverse goals (Liu et al., 2021).

4. Practical Applications and Benchmarks

GCPs are widely validated in continuous control (MuJoCo), robotic manipulation (Fetch, Sawyer), navigation (2D, AntMaze, Underwater), and hierarchical multi-goal settings. Representative experimental scenarios include:

Simulation: Robosuite manipulation (coffee, assembly, kitchen), grid-worlds, maze navigation, and locomotion (Ant, Humanoid), using varying degrees of sensory noise and goal variability (Zhou et al., 26 Nov 2025, Wang et al., 2023, Mavalankar, 2020).
Offline RL: Methods such as MetricRL and GOPlan demonstrate that GCPs can achieve near-optimal performance even from severely sub-optimal or limited datasets, leveraging metric learning, model-based planning, and multi-modal policy heads (Reichlin et al., 2024, Wang et al., 2023).
Real-world robotics: Hyper-GoalNet achieves robust manipulation under sensor and physical uncertainty, outperforming fixed-parameter baselines by large margins (Zhou et al., 26 Nov 2025). Vision-based GCPs have been successfully deployed for autonomous underwater navigation over extended open-ocean routes (Manderson et al., 2020).
Long-horizon and sequential goals: Conditioning on sequences of goals, rather than only the current or final goal, improves sample efficiency and robustness in hierarchical or planner-driven scenarios (Serris et al., 27 Mar 2025).

5. Theoretical Insights and Representational Limits

Goal-conditioned RL exposes fundamental distinctions from classical "dense reward" approaches. Recent analysis provides:

Optimality gap: There is a Jensen gap between optimizing the expected log-probability objective (quadratic/dense, classical control) and the goal-conditioned probabilistic reward; this gap can favor GCPs, which avoid vanishing gradients and automatically adapt reward tolerance (Lawrence et al., 6 Dec 2025).
Dual control structure: In partially observed settings, goal-conditioned rewards inherently yield a dual-control problem, forcing the agent to balance exploration (uncertainty reduction) and exploitation (goal-reaching), even in the absence of explicit shaping (Lawrence et al., 6 Dec 2025).
Action sufficiency versus value sufficiency: The goal representation required for optimal action selection generally contains more information than that needed for value prediction. Standard log-loss training of low-level policies recovers action-sufficient encodings, while value-based compressions may collapse necessary distinctions and degrade performance (Hyeon et al., 30 Jan 2026).

6. Extensions: Hierarchies, Distributions, and Logical Tasks

GCPs provide building blocks for more sophisticated abstractions:

Hierarchical GCPs: HLAs composed of short-horizon GCPs can be planned and reused for rapid adaptation and transfer (Rens, 3 Jan 2025).
Distribution-conditioned policies: DisCo RL demonstrates that generalizing from single-state goals to entire goal distributions enhances task expressivity and generalization, with policies parametrized by distributions over goals or reward functions (Nasiriany et al., 2021).
Compositional logical tasks: LS-GKDP formalizes goal-conditioned options with ordering constraints, enabling compositional reasoning over super-exponential subgoal structures and efficient zero-shot transfer via goal kernel planning (Ringstrom et al., 2020).

7. Current Challenges and Research Directions

Despite progress, open questions persist:

Robustness to OOD goals: Generalization and robustness outside the training goal set remains a significant challenge. Model-based planning with learned priors and distribution-conditioned conditioning can partially address this (Wang et al., 2023, Nasiriany et al., 2021).
Goal curriculum and discovery: Adaptive goal generation and discovery of intermediate goals remains critical for enabling agents to self-improve on challenging tasks (Venkattaramanujam et al., 2019).
Scaling and sample efficiency: End-to-end hypernetworks and generative policy architectures (e.g., diffusion policies) enable fast adaptation and multi-modality but raise concerns about sample efficiency and stability (Zhou et al., 26 Nov 2025, Reuss et al., 2023).
Representation learning: Ensuring that learned goal encodings do not lose information essential for action—i.e., achieving action sufficiency—continues to shape the design of GCP systems, especially as hierarchical and compositional methods become widespread (Hyeon et al., 30 Jan 2026).

Research in goal-conditioned policies continues to evolve rapidly, integrating representational, algorithmic, and theoretical advances to push the boundaries of task-general, sample-efficient, and robust control across domains and physical platforms.