Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 72 tok/s
GPT OSS 120B 441 tok/s Pro
Kimi K2 200 tok/s Pro
2000 character limit reached

Goal-Conditioned Value Functions

Updated 31 August 2025
  • Goal-conditioned value functions predict returns conditioned on achieving a specific goal, extending classical RL to a multitask setting.
  • They leverage dynamics-aware, self-supervised embedding methods to learn action-based distance metrics that reflect true state transitions.
  • Automatic goal generation using action noise creates an evolving online curriculum, validated by superior goal coverage in continuous-control domains.

A goal-conditioned value function is a generalization of the classical value function in reinforcement learning (RL), which, instead of predicting the expected return from a state under a fixed reward, gives the expected return from a state conditioned on the agent achieving a specific goal. This extension transforms the learning process from single-task to multitask RL—enabling agents to simultaneously generalize, plan, and evaluate policies across an entire family of objectives, goals, or subgoals. Contemporary approaches have focused on developing dynamics-aware, self-supervised, and representation-based value functions, shifting away from hand-crafted metrics to those learned directly from agent-environment interactions.

1. Dynamics-Aware Distance to Goal

A foundational contribution to goal-conditioned value functions is the introduction of the action-based, or “action distance,” metric between states. In this formulation, the distance between two states s1s_1 and s2s_2 is defined as the average number of actions required to traverse from one state to the other and vice versa under a given policy π\pi: dπ(s1,s2)=12m(s2s1)+12m(s1s2)d^\pi(s_1, s_2) = \frac{1}{2} m(s_2|s_1) + \frac{1}{2} m(s_1|s_2) where m(s2s1)m(s_2|s_1) denotes the expected number of steps to reach s2s_2 from s1s_1 for the first time (first-passage time). This “commute time” is a dynamics-aware analogue of Euclidean distance, inherently capturing the temporal and environmental characteristics of the agent's interactions.

Since direct computation of dπd^\pi in high-dimensional or online RL settings is intractable, a learnable embedding function eθe_\theta is introduced. This mapping ensures that the pp-norm distance in the embedding space aligns with the number of actions needed between any two states, as enforced by the loss: θ=argminθ(eθ(si)eθ(sj)pqdπ(si,sj))2\theta^* = \arg\min_\theta \left( \left\| e_\theta(s_i) - e_\theta(s_j) \right\|_p^q - d^\pi(s_i, s_j) \right)^2 This ensures the learned representation reflects the environment’s true reachability structure, providing accurate, dynamics-aware success criteria for goal achievement (Venkattaramanujam et al., 2019).

2. Self-Supervised Distance Learning and Policy Integration

The distance estimator is trained in a self-supervised fashion, exploiting trajectories collected from the current policy (or exploration policy). First-passage times m(sjsi)m(s_j|s_i) are empirically estimated from observed rollouts, and the embedding network is trained concurrently with the policy. This approach automatically aligns reward signals with the agent’s evolving behavioral competence, eliminating the need for manually engineered success metrics.

The learned distance function is used as both the reward shape and for defining the termination condition of goal-reaching. Success is now declared not by hand-crafted thresholds (e.g., proximity in L2 norm) but by the agent entering an “ε\varepsilon-sphere” in the learned embedding space, which more accurately corresponds to feasible transitions and temporal costs.

In practice, this pipeline enables curriculum learning, as the embedding space adapts during training, allowing clear separation between states that are easy to reach and those that require more sophisticated behaviors. Goal-conditioned value functions then act as a unified signal for both reward assignment and goal-attainment verification across tasks where (a) the goal space equals the state space, (b) the goal space lacks a known distance metric, or (c) only a subset of the state space is feasible as a goal (Venkattaramanujam et al., 2019).

3. Dynamics-Informed Automatic Goal Generation

To foster exploration and prevent stagnation around already-mastered goals, the framework introduces a simple, dynamics-respecting goal generation mechanism. Upon successful attainment of a goal, the agent continues to act randomly for several steps. The resulting states are stored and sampled as future candidate goals. This “action-noise-based” approach ensures that new goals are near the boundary of the agent's current reachable set, resulting in a sequence of “Goals of Intermediate Difficulty” (GOID). Automatic goal generation in this manner produces an online curriculum, guaranteeing continual skill development in regions of the state space that are both accessible and incrementally challenging, in stark contrast to GAN-based or heuristic offline strategies.

4. Empirical Validation: Applications and Metrics

Empirical results demonstrate the approach in continuous-control domains, such as Point Mass and Ant (including Maze Ant) in MuJoCo. In all settings, the learned distance function provided a robust measure of goal attainment that outperformed or matched hand-engineered L₂ metrics.

A critical evaluation metric was “coverage”: the fraction of possible goals reached by the agent averaged over a goal grid. Agents using the learned value function-based metric achieved higher or comparable coverage to those using naïve L₂-based distances, demonstrating superior generalization and no dependence on specialized domain knowledge.

Off-policy training of the distance predictor was shown to be particularly beneficial, mitigating the “expanding ε\varepsilon-sphere” issue (where the distance function’s geometry inadvertently enlarges the set of states considered ‘close’) (Venkattaramanujam et al., 2019).

5. Theoretical and Methodological Impact

By constructing a representation in which the pp-norm between embeddings matches the (policy-dependent) number of required actions, the framework transforms the conception of value estimation. Rather than an abstract expected return, the value function becomes a direct proxy for a physically grounded, environment-informed goal distance. This perspective enables more effective goal-achievement signals, policy improvement, and reward design—especially in online, reward-free, and domain-agnostic RL settings.

The self-supervised distance learning mechanism also represents a methodological advance: since it does not require externally provided rewards or human-defined metrics, it is broadly applicable to diverse environments where explicit rewards or metric structure may be unavailable or ill-defined.

6. Limitations, Scope, and Extensions

While the method dispenses with prior domain knowledge, it does introduce several challenges. (1) The embedding space geometry must track the continually evolving policy, requiring careful design of off-policy updating and regularization to avoid artifacts. (2) State-visit frequency adjustments are required to appropriately balance frequently and rarely visited states—a challenge addressed via a stationary distribution-weighted variant of the commute time.

The framework is compatible with other goal-conditioned RL extensions, such as hierarchical decomposition, latent variable planning, and curriculum learning. Its principle—dynamics-aware representation aligned with first-passage temporal costs—serves as a foundation for subsequent advances in goal-conditioned value function learning and curriculum design.

7. Summary Table of Core Mechanisms

Component Methodological Role Implementation Insight
Action distance (dπd^\pi) Defines agent-centric, dynamics-based metric Estimation via first-passage times
Embedding (eθe_\theta) Latent representation aligned with dπd^\pi Self-supervised, with pp-norm loss
Goal generation Automatic, online curriculum Action noise after goal reached
Policy learning Uses learned distance as reward + termination RL with embedded metric-based signals

This framework provides a dynamics-informed, self-adaptive, and curriculum-compatible foundation for robust goal-conditioned value function estimation and efficient policy learning. Experimental and theoretical findings substantiate that learning distance-to-goal functions surpasses domain-specific metrics, enables efficient goal exploration, and drives the advancement of practical and scalable goal-conditioned RL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)