Goal-Conditioned Agents in Reinforcement Learning

Updated 4 July 2026

Goal-conditioned agents are reinforcement learning models that condition actions on both current state and explicit goals, enabling versatile multi-task achievement.
They incorporate strategies like goal relabeling, language and visual grounding, and subgoal planning to address sparse rewards and improve exploration.
Recent advancements leverage dense rewards, adversarial training, and open-ended goal generation to enhance autonomy, scalability, and robust policy transfer.

Searching arXiv for recent and foundational papers on goal-conditioned reinforcement learning and agents. Goal-conditioned agents are reinforcement-learning agents whose behavior depends on both the current state or observation and an explicit goal, typically through a policy of the form $\pi(a \mid s,g)$ . In the standard survey formalism, goal-conditioned reinforcement learning (GCRL) augments the underlying MDP with a goal space $\mathcal{G}$ , a goal distribution $p_g$ , and a mapping $\phi$ from states to achieved-goal representations; the agent then learns a single policy that can solve many tasks or reach many targets rather than optimizing one fixed reward function (Liu et al., 2022). In developmental formulations, a goal is not merely a target state but a pair $g=(z_g,R_g)$ consisting of a compact goal representation and a goal-achievement function, so that the policy and the achievement criterion jointly determine what it means to solve the goal (Colas et al., 2020). Recent work extends this paradigm beyond externally specified reward maximization to reward-free autonomous learning, temporal-goal conditioning, multi-agent inference, and all-goals learning at scale (Åström et al., 6 Nov 2025).

1. Formal structure and core problem formulation

Goal-conditioned reinforcement learning is commonly presented as a generalization of standard RL in which the policy is conditioned on a goal input. The standard contrast is between a conventional policy $\pi(a \mid s)$ trained for one reward function and a goal-conditioned policy $\pi(a \mid s,g)$ trained under a distribution of goals. The survey literature also distinguishes three closely related notions: the desired goal, the achieved goal, and the behavioral goal. A common goal-reaching criterion is sparse and thresholded, for example $\mathds{1}(\|\phi(s_{t+1})-g\|\le\epsilon)$, while a common dense alternative is a distance-shaped reward such as $\tilde r_g(s_t,a_t,g)=-d(\phi(s_{t+1}),g)$ ; the same surveys note that dense shaping can create local optima and can discourage necessary detours (Liu et al., 2022).

Developmental and autotelic formulations make the semantics of goals more explicit. In that line of work, a goal-conditioned policy is written as $\Pi:\mathcal{S}\times\mathcal{Z}_G\to\mathcal{A}$ , where $\mathcal{G}$ 0 is a goal-embedding space and the associated goal-achievement function determines whether progress has been made. This framing treats goal representation, goal achievement, and goal selection as distinct but coupled problems. The same literature presents RL-based Intrinsically Motivated Goal Exploration Processes as a computational framework in which an agent samples a goal, acts with a goal-conditioned policy, computes internal reward, and updates both the policy and the goal-related modules (Colas et al., 2020).

A closely related formal distinction appears in recent work on open-ended learning problems for goal-conditioned agents. There, a goal-conditioned RL problem is a tuple $\mathcal{G}$ 1, where the reward function is goal-conditioned, and an open-ended GCRL problem is one in which the goals come from an open-ended generation process. The central property is novelty over an infinite horizon: for any time $\mathcal{G}$ 2, there exists a later time $\mathcal{G}$ 3 at which the process produces a token that is new from an observer’s perspective. This separates open-endedness from autotelic learning and lifelong learning, which are treated as orthogonal properties rather than synonyms (Sigaud et al., 2023).

2. Goal representations and grounding mechanisms

A large part of the literature concerns how goals are represented. Survey work groups goal representations into vector goals, image goals, language goals, and broader constructions such as abstract binary problems and multi-objective balances. Vector goals are common in control and robotics; image goals place the state and goal in the same high-dimensional observation space; language goals express goals through instructions or predicates; and abstract binary goals specify satisfaction of constraints rather than proximity to a target state (Liu et al., 2022). The developmental survey broadens this typology by treating goals as target features, abstract binary problems, or weighted balances among multiple objectives, and by cataloguing conditioning mechanisms ranging from simple concatenation to FiLM-style modulation and neural module networks (Colas et al., 2020).

One influential way to separate control from semantics is to use language to generate goals rather than actions. In “Language-Conditioned Goal Generation: a New Approach to Language Grounding for RL” (Colas et al., 2020), a language-conditioned goal generator maps an instruction and the current state into a distribution over language-agnostic goals for a separate goal-conditioned policy. In the paper’s Fetch Manipulate instantiation, semantic configurations are binary predicate-based vectors over relations such as close and above; with three objects, the semantic configuration has size $\mathcal{G}$ 4 in $\mathcal{G}$ 5, and the agent can reach 35 physically valid configurations. The goal generator is a c-VAE trained with a reconstruction-plus-KL objective with $\mathcal{G}$ 6, latent size $\mathcal{G}$ 7, batch size $\mathcal{G}$ 8, Adam at learning rate $\mathcal{G}$ 9, and $p_g$ 0 training epochs. The reported results are CP around $p_g$ 1– $p_g$ 2 and Cov around $p_g$ 3– $p_g$ 4 across five generalization settings, while end-to-end grounding yields Transition SR1 $p_g$ 5, SR5 $p_g$ 6, Expression SR1 $p_g$ 7, SR5 $p_g$ 8, and an average of $p_g$ 9 successful instructions before failure in the sequence setting (Colas et al., 2020).

Visual goal grounding has likewise shifted from privileged coordinates to representations intended to preserve task-relevant structure while improving transfer. “General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks” (Shahriar et al., 6 Oct 2025) represents the goal as a binary mask channel appended to the camera image, with the mask generated from privileged target position in simulation, color/shape segmentation, or open-vocabulary detectors such as Detic and Grounding DINO. The masked image is $\phi$ 0, and three recent augmented frames are stacked. The same work defines a mask-based dense reward from normalized activated mask area and a scaled sigmoid, reports 99.9% average reaching accuracy on both train and unseen test objects, 96% success on in-distribution real UR10e objects and 92% success on both OOD test sets, and shows that a Franka-Detic agent learns an optimal reaching policy in about 60,000 steps (Shahriar et al., 6 Oct 2025). Closely related representation-learning approaches include DR-GRL, which uses a weakly supervised Spatial Transform AutoEncoder to separate shape, color, and position and computes reward from the position latent, and DGRL, which inserts a factorial vector-quantization bottleneck so that goals are specified as discrete combinations of learned factors (Qian et al., 2022, Islam et al., 2022).

These developments indicate that the practical efficacy of goal-conditioned agents depends strongly on whether the goal representation aligns with control-relevant invariants. A plausible implication is that the central representation question in GCRL is not only expressivity but also whether goals can be sampled, relabeled, compared, and grounded without introducing nuisance variation or unattainable targets.

3. Goal generation, relabeling, and autonomous curricula

A defining feature of modern goal-conditioned agents is that a single trajectory can be reused for many goals. Survey treatments identify three major intervention points in GCRL: direct optimization of goal-conditioned policies, sub-goal selection or generation, and goal relabeling. Hindsight Experience Replay is the canonical relabeling method: a failed trajectory is relabeled with goals the agent actually achieved, converting failure experience into off-policy supervision. The same surveys place HER alongside curriculum relabeling, density-based relabeling, and learning-based relabeling, and treat intermediate difficulty, novelty, and learning progress as major principles for goal sampling (Liu et al., 2022, Colas et al., 2020).

“Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning” (Åström et al., 6 Nov 2025) pushes this logic to a reward-free setting by wrapping ordinary RL environments in a goal-conditioning interface and letting the agent select its own goals in an environment-agnostic way. The method uses three example goal-selection strategies: uniform random sampling over the observation space, novelty-based selection with weights

$\phi$ 1

and intermediate success-rate selection with

$\phi$ 2

The supplementary material specifies that goals are successful when a normalized distance between observation and goal is below $\phi$ 3, and that 10\% of the time a random goal is selected from the observation space. The method is presented as independent of the underlying off-policy learner; the experiments use Stable-Baselines3 with Deep Q-Networks and also report compatibility with Soft Actor-Critic. Hindsight Experience Replay is used for relabeling. Empirically, the approach achieves training times and performance comparable to an externally reward-guided baseline in the studied environments, reaches the optimal policy faster than the reward-aware DQN baseline on Cliff Walking, and yields a pattern in which average goal success gradually improves and stabilizes even though individual goals can fluctuate substantially during training (Åström et al., 6 Nov 2025).

Adversarial settings expose a complementary curriculum problem: which relabeled goals remain informative when another agent actively disrupts progress. “Goal-Conditioned Reinforcement Learning in the Presence of an Adversary” (Purves et al., 2022) proposes EHER, which prioritizes hindsight goals with high temporal-difference error, and CHER, which prioritizes them using a novelty signal derived from a fixed random network and a trainable predictor. In the reported experiments, a constant mix-in probability $\phi$ 4 works best, EHER improves over HER, CHER does not outperform EHER, and the benefit of EHER becomes more pronounced in larger or harder state spaces. The same paper introduces IGOAL, a framework in which the adversary is periodically copied from the agent so that adversarial difficulty increases with competence; the strongest results come from combining IGOAL with EHER (Purves et al., 2022).

A more radical reuse strategy appears in “Goal-Conditioned Agents that Learn Everything All at Once” (Matthews et al., 22 May 2026). Instead of relabeling each transition separately for each goal, LEO moves the goal dimension from the input to the output so that one forward pass returns values or actions for every goal at once. In the discrete formulation, the conventional $\phi$ 5 is replaced by $\phi$ 6, enabling vectorized all-goals updates. On CraftaxGC with 512 goals, the paper reports only a 34% slowdown relative to single-goal learning while naive all-goals relabeling is $\phi$ 7 slower than LEO; the abstract summarizes this as a >250x speed-up (Matthews et al., 22 May 2026). This suggests that the distinction between relabeling and direct all-goals prediction is increasingly one of systems design and computational scaling rather than learning principle alone.

4. Planning, subgoals, hierarchy, and long-horizon control

Long-horizon tasks often exceed what a flat goal-conditioned policy can learn from sparse reward alone, so many systems interpose planning or hierarchical structure between the commanded goal and low-level action selection. Survey work classifies these methods under sub-goal selection or generation and model-based planning, emphasizing graph search over replay-buffer states, imagined waypoints, and subgoal discovery from trajectories or demonstrations (Liu et al., 2022).

“Imitating Graph-Based Planning with Goal-Conditioned Policies” (Kim et al., 2023) proposes PIG, a generic add-on for graph-based GCRL methods. A weighted directed graph is built over visited states, a planner finds a shortest path of subgoals, and the actor is trained not only with the RL objective but also with a self-imitation loss that distills the subgoal-conditioned behavior into the target-goal-conditioned policy:

$\phi$ 8

The method also introduces stochastic subgoal skipping with jump probability

$\phi$ 9

On the Large U-shaped AntMaze, MSS + PIG reaches 57.41% success at $g=(z_g,R_g)$ 0 environment steps, compared with 19.08% for MSS (Kim et al., 2023).

REPlan addresses the same long-horizon problem in vision-based manipulation by coupling planning with a disentangled representation and a learned reachability discriminator. “Goal-Conditioned Reinforcement Learning with Disentanglement-based Reachability Planning” (Qian et al., 2023) learns a compact latent space that separates robot, object, and background, trains a reachability classifier from replay trajectories, and uses CEM to plan intermediate subgoal sequences in latent space. TD3 and HER form the low-level off-policy backbone. With all components present, the reported success rates are 90.0% on UR-Pusher-1, 93.3% on UR-Pusher-2, 83.3% on UR-Pick-Place, and 76.67% on real-world UR5 trials (Qian et al., 2023).

Safe long-horizon navigation introduces an additional planning axis: costs. “Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning” (Feng et al., 25 Feb 2025) trains an unconstrained goal-conditioned agent to estimate distance and cost, fine-tunes it with a Lagrangian actor-critic to obtain a safe goal-conditioned policy $g=(z_g,R_g)$ 1, and builds a replay-buffer graph $g=(z_g,R_g)$ 2 whose edges are pruned if predicted distance exceeds $g=(z_g,R_g)$ 3 or predicted cost exceeds $g=(z_g,R_g)$ 4. In the multi-agent setting, the same graph and low-level policy are reused inside Conflict-Based Search, so the method does not require retraining for each number of agents (Feng et al., 25 Feb 2025).

Hierarchical and bidirectional variants further broaden the architectural repertoire. “A Fully Controllable Agent in the Path Planning using Goal-Conditioned Reinforcement Learning” (Lee, 2022) uses bi-directional memory editing to turn a single forward trajectory into forward- and reverse-edited subgoal data, trains a dedicated sub-goals network separately from the main policy network, and adds reward shaping

$g=(z_g,R_g)$ 5

to bias the agent toward shorter paths. In the reported 20-scenario comparison, the average steps drop from 411.45 without reward shaping to 338.25 with reward shaping, a reduction of about 21.6% (Lee, 2022).

5. Extensions beyond single-agent Markovian goal reaching

Goal-conditioned agents are not restricted to “reach this target state” formulations. One direction generalizes from control to inference. “Multi-Agent Goal Recognition with Team- and Goal-Conditioned Reinforcement Learning and Factorized Branch-and-Bound” (Thomas et al., 24 Jun 2026) uses a single shared policy conditioned on team identity and goal as a scoring model inside branch-and-bound search over a combinatorial hypothesis space. In the Blocksworld benchmark, the exhaustive count is

$g=(z_g,R_g)$ 6

complete hypotheses per observed step. The shared policy achieves 98.44% episode success and 99.22% per-team success; every evaluated variant returns the same top-1 hypothesis at every observed step as exhaustive search, while Full MAGR-BB considers 1 partition instead of 6, emits only 10 complete hypotheses instead of 7.15 million at the final observed step, and reports runtime speedups of 2.91× at lower noise and 2.43× at noise $g=(z_g,R_g)$ 7 (Thomas et al., 24 Jun 2026). In this setting, goal conditioning functions as a reusable likelihood model rather than a controller.

Another direction generalizes the goal itself from Markovian targets to temporal specifications. “Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning” (Yalcinkaya et al., 2024) represents goals as compositions of deterministic finite automata and conditions policies on the resulting cDFA embeddings. The paper defines a DFA-conditioned policy $g=(z_g,R_g)$ 8, interprets each path through a DFA as a sequence of reach-avoid tasks, and pretrains a GATv2 encoder on reach-avoid-derived DFAs. The reported results include approximately ~90% accuracy across many unseen task classes in dummy evaluation, near 100% in most cases, generalization to 10 DFAs in a composition even though training compositions were truncated to at most 5, and frozen-encoder training that is about 20% faster in wall-clock time than training through the GNN during downstream RL (Yalcinkaya et al., 2024).

Goal-conditioned agents have also been extended to analogy-making. “Bisimulation Makes Analogies in Goal-Conditioned Reinforcement Learning” (Hansen-Estruch et al., 2022) defines a goal-conditioned bisimulation metric over state-goal pairs rather than only over states, learns a paired-state embedding $g=(z_g,R_g)$ 9 and a single-state embedding $\pi(a \mid s)$ 0, and uses latent arithmetic so that an agent can infer a new goal from an analogous task rather than from an exact goal image. In the reported Button-and-Drawer analogy setting, the method achieves about 40% average success, outperforming the baselines in that evaluation (Hansen-Estruch et al., 2022).

These developments broaden the operational meaning of a goal-conditioned agent. A plausible implication is that the “goal” need not be a terminal observation or coordinate at all: it can be a team hypothesis, a temporal automaton, an analogy class, a desired return, or a structured composition of constraints, provided there is a conditioning interface and a corresponding achievement or scoring function.

6. Generalization, autonomy, and recurrent limitations

Generalization across environments is a persistent concern because many goal-conditioned agents are trained from rich observations that vary spuriously across deployment settings. “Learning Domain Invariant Representations in Goal-conditioned Block MDPs” (Han et al., 2021) formalizes this with Goal-conditioned Block MDPs, where observations depend on both a domain-invariant state and an environment-specific nuisance factor. The paper introduces perfect alignment, defined by

$\pi(a \mid s)$ 1

and derives a generalization bound in which better alignment reduces transfer error. Its practical method, PA-SkewFit, combines aligned sampling across environments with reconstruction, KL, MMD, and separation losses; empirically, it improves by 50% over baselines on unseen test environments and reports that only about 15% aligned data was sufficient for strong alignment in practice (Han et al., 2021).

A second recurrent theme is autonomy in the sense of self-generated goals and growing repertoires of skills. Developmental RL literature argues that autotelic agents must learn not only to solve goals but also to represent, generate, and prioritize them (Colas et al., 2020). The open-ended learning literature further distinguishes first-order open-ended GCRL, where all goals are sampled from a fixed goal space, from second-order open-ended GCRL, where the goal representation spaces themselves are generated by an open-ended process (Sigaud et al., 2023). Reward-free autonomous learning in ordinary RL environments gives this idea a more concrete operational form: a wrapped agent can later be instructed to seek any observations made in the environment, effectively yielding a reusable repertoire over observation space rather than a single task-specific policy (Åström et al., 6 Nov 2025).

At the same time, the literature repeatedly identifies limitations. Survey work emphasizes sparse rewards, exploration, credit assignment, generalization to unseen goals, long-horizon planning, and representation learning as central difficulties (Liu et al., 2022). Environment-agnostic self-generated goals can be unstable because the agent does not value some goals more than others; the reported consequence is average competence growth alongside brittle per-goal mastery and potential waste on impossible or poorly aligned goals (Åström et al., 6 Nov 2025). Visual methods based on masks depend on detector quality; in one real-world experiment, Grounding DINO produced false positives and did not learn reliably (Shahriar et al., 6 Oct 2025). Pure LEO is limited to finite, manageable goal sets and can underperform on easy goals because of a late-fusion bottleneck (Matthews et al., 22 May 2026). Open-ended formulations, finally, are explicit that novelty over time does not by itself guarantee competence growth, creative goal discovery, or immunity to forgetting (Sigaud et al., 2023).

Taken together, the literature defines goal-conditioned agents less by a single algorithm than by a recurring architecture of ideas: a goal representation, a goal-achievement criterion, a conditioning mechanism for behavior or scoring, and a strategy for selecting, generating, relabeling, or composing goals. Contemporary work extends this architecture from externally specified target states to language-conditioned goal generation, self-generated reward-free practice, graph-planned and safety-constrained subgoals, automata-defined temporal objectives, all-goals learning, adversarial robustness, domain-invariant transfer, and open-ended skill repertoires (Liu et al., 2022, Colas et al., 2020, Åström et al., 6 Nov 2025).