Goal-Conditioned Value Functions

Updated 16 October 2025

Goal-conditioned value functions are a reinforcement learning generalization where returns are computed relative to both the current state and a specified goal, enabling simultaneous multi-task learning.
They leverage techniques like contrastive representation and distance-based methods to accurately model expected reachability and support curriculum learning and hierarchical policy design.
These functions have been applied successfully in robotics, locomotion, and language-based reasoning, demonstrating robust performance in sparse reward environments and complex planning tasks.

A goal-conditioned value function is a generalization of the classical value function in reinforcement learning (RL) in which the expected return (or cost) is computed not only with respect to the current state (and possibly action), but also with respect to a user-specified goal. By explicitly conditioning policies and value functions on goals, RL agents can learn to solve a family of tasks concurrently, reason about subgoal attainment, and generalize behavior across a continuum of objectives. Goal-conditioned value functions therefore lie at the core of goal-conditioned policy architectures, multi-goal RL, and have enabled rapid progress in unsupervised skill acquisition, curriculum learning, and flexible robotic control.

1. Formal Definitions and Key Properties

In the Markov Decision Process (MDP) framework with goal conditioning, a goal $g\in\mathcal{G}$ is i.i.d. sampled (per episode or per transition) and the agent’s reward function is modified as $r_g(s, a)$ so that it is a function of the goal (often sparse, e.g., $r_g(s, a)=\mathbb{I}[s=g]$ ). The goal-conditioned state-value function is

$V^\pi_g(s) = \mathbb{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t r_g(s_t, a_t) \mid s_0 = s \right],$

and the goal-conditioned action-value function is

$Q^\pi_g(s, a) = \mathbb{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t r_g(s_t, a_t) \mid s_0 = s, a_0 = a \right].$

Key properties arising from this definition include:

The value function’s dependence on $g$ allows multi-task behavior from a single policy.
The value landscape encodes, for each $(s,g)$ pair, the expected reachability or cost-to-go.
Under certain reward structures (e.g., step penalty $-1$ until the goal), $V^*_g(s)$ is monotonically related to the expected distance (in steps) to $g$ (Opryshko et al., 8 Oct 2025).

In sparse reward settings, the optimal value function can be mapped to an effective “distance” function by, for instance:

$\hat{d}(s,g) = \log_\gamma [1 + (1-\gamma)V^*_g(s)]$

if using per-step penalties (Opryshko et al., 8 Oct 2025).

2. Approaches to Goal-Conditioned Value Function Learning

Model-Free Temporal Difference and Distance-Based Methods

Early goal-conditioned RL methods applied standard TD learning to $Q(s, a, g)$ or $V(s, g)$ with HER-based relabeling to address reward sparsity (e.g., DDPG+HER, GC-UVFA) (Lawrence et al., 10 Feb 2025). More recent works have developed:

Self-Supervised Action Distance Estimation: Learning an embedding $e_\theta(\cdot)$ such that the $p$ -norm $\|e_\theta(s) - e_\theta(g)\|_p$ approximates the expected number of steps (action distance) between $s$ and $g$ . Training is done using a metric MDS loss on empirical passage times and typically avoids hand-designed distance functions (Venkattaramanujam et al., 2019).
Contrastive Representation Learning: Learning goal-conditioned value functions by training encoders under a contrastive loss where positives are achieved future states and negatives are sampled from unrelated goals. The inner product of learned representations becomes a scaled version of the value function, providing a direct link between success likelihood and representation geometry (Eysenbach et al., 2022).

Density and Metric Learning

Universal Value Density Estimation (UVD): Recasting the Q-function as the discounted density over achievable goals, then leveraging density estimation (e.g., normalizing flows) to directly fit $Q(s, a, g)$ as a probability measure (Schroecker et al., 2020).
Metric Learning with Distance Monotonicity: Enforcing latent mappings that preserve stepwise distances or maintain the ordering of distances between states and goals (distance monotonicity) enables the greedy policy with respect to the latent metric to be optimal. The value is then parameterized as $V(s) = \gamma^{d_Z(\phi(s), \phi(g))} r_g$ (Reichlin et al., 16 Feb 2024).

Physics-Informed and Structured Regularization

Physics-Informed Regularizers: Adding a regularization term based on the Eikonal PDE (encouraging $\|\nabla_s V(s, g)\| \approx$ const) shapes learned value functions to resemble distance fields, improving geometric fidelity and long-horizon generalization (Giammarino et al., 8 Sep 2025).

Knowledge Distillation and Robustness

Knowledge Distillation: Applying attention transfer losses between gradients of the student and teacher Q-functions with respect to the goal provides richer supervision, speeding up learning especially as the dimensionality of the goal increases (Levine et al., 2022).
Scenario-Based Robustness: Incorporating scenario trees at both RL and MPC levels yields value functions robust against uncertain system parameters by averaging costs over sampled disturbance realizations (Lawrence et al., 10 Feb 2025).

3. Curriculum Learning and Goal Generation

A critical challenge in goal-conditioned RL is ensuring efficient exploration and curriculum construction:

Action Noise Goal Generation: Generating new goals by perturbing existing ones using short random action-rollouts ensures feasibility and adaptive curriculum, circumventing the need for complex domain-informed methods (Venkattaramanujam et al., 2019).
Landmark-Guided and Prospect Measures: Incorporating graph-based planning with the goal-conditioned value function enables landmark selection. The “prospect” of a subgoal $g_t$ can be defined via its distance in the latent (or embedding) space to strategic landmarks found by farthest point sampling, guiding high-level exploration effectively (Cui et al., 2023).

4. Hierarchical and Temporally Abstracted Value Functions

Goal-conditioned architectures benefit from hierarchical and temporally abstracted methods to address long-horizon challenges:

Hierarchical Value Decomposition: Splitting policy learning into high-level (subgoal/pseudo-goal selection) and low-level (primitive action) policies, both grounded in a shared value function $V(s, g)$ , enables robust decomposition. High-level policies operate in a latent or embedding space, which may be learned via VAEs or contrastive objectives (Park et al., 2023).
Option-Aware TD Learning: Employing option-aware value updates, where bootstraps happen only upon transitions due to temporally extended options, contracts the effective horizon and stabilizes value propagation under long-horizon, offline datasets (2505.12737).

5. Applications and Empirical Findings

Goal-conditioned value function learning is validated in a variety of domains:

Robotics and Locomotion: Sample-efficient curriculum learning, robust transfer to out-of-distribution goals, and reliable zero-shot transfer are demonstrated in point/ant mazes, dexterous hand control, manipulation, and navigation (Hong et al., 2022, Lawrence et al., 10 Feb 2025, Opryshko et al., 8 Oct 2025).
Model Predictive Control (MPC): Goal-conditioned terminal value functions, integrated into MPC, enable real-time and multitask control by allowing the terminal value to adapt to varying goals (e.g., tracking on sloped terrain with a biped) (Morita et al., 7 Oct 2024).
Language-Based Reasoning: Goal-conditioned value critics, applied at the “thought” or reasoning step level, are used to guide open-vocabulary agents such as LLMs across multi-turn dialogue, tool-use, and social tasks, allowing efficient lightweight planning over reasoning chains (Hong et al., 23 May 2025).

Empirical results highlight substantial improvements in learning efficiency, stable horizon contraction, value function order consistency, trajectory stitching, and robust performance in the presence of sparse or stochastic reward structures.

6. Planning, Test-Time Algorithms, and Limitations

The structure encoded by goal-conditioned value functions provides actionable metrics for test-time planning:

Test-Time Graph Search (TTGS): Building a state graph with weights defined by value-derived distances and performing Dijkstra’s search produces subgoal sequences that enable trajectory stitching, significantly boosting performance of frozen policies on long-horizon and sparse-reward tasks (Opryshko et al., 8 Oct 2025).
Correction of Value Artifacts: Combining model-based planning over the learned value landscape with graph-based value aggregation addresses both local estimation artifacts (“ $\mathcal{T}$ -local optima”) and global propagation errors, yielding robust zero-shot goal-reaching behavior (Bagatella et al., 2023).

A notable limitation is the accumulation of estimation errors over long horizons, which can lead to unreliable subgoal selection or local optima in value estimation. Regularization techniques, horizon contraction (option/TEMPABSTRACTION), and planning using subgoal graphs or model-predictive rollouts are necessary mitigations.

7. Outlook and Future Directions

Goal-conditioned value function research is actively progressing across several axes:

Integration with large-scale unsupervised data and offline RL for scalable skill reuse.
Merging RL value functions with model-predictive and symbolic planning frameworks in hybrid closed-loop systems.
Advancements in geometric and structural regularization to encode environmental priors (e.g., via PDEs or spectral methods).
Extension to high-dimensional, perceptual, or multi-modal goal spaces using contrastive, metric, and density-based objectives.
Generalization to reasoning, language, and multi-agent settings via value critics over abstract state/goal representations.

Development in curriculum design, action-free data support, and robust multi-goal learning will further enhance the applicability of goal-conditioned value functions to diverse RL and decision-making problems. The confluence of geometric, hierarchical, and physics-informed perspectives provides numerous fruitful directions for future research.