Chain-of-Goals Hierarchical Policy (CoGHP)

Updated 10 February 2026

Chain-of-Goals Hierarchical Policy (CoGHP) is a paradigm that decomposes long-range objectives into chains of intermediate, semantically meaningful sub-goals.
It employs hierarchical structures where a high-level policy sets sub-goals and a low-level policy executes actions, utilizing methods like MDPs, autoregressive models, and dynamic programming.
Empirical studies show that CoGHP improves sample efficiency and success rates in tasks such as navigation and robotic manipulation compared to flat RL approaches.

A Chain-of-Goals Hierarchical Policy (CoGHP) is a reinforcement learning and planning paradigm that decomposes long-horizon tasks into sequences—chains—of intermediate, semantically meaningful sub-goals, each pursued by lower-level policies or skills, allowing both sample-efficient exploration and interpretable decision-making. The approach has seen diverse algorithmic realizations, spanning hierarchical deep RL, linearly-solvable MDPs, autoregressive sequence models, and classical campaign planning, with consistent emphasis on chaining intermediate objectives to bridge long-term dependencies and handle sparse rewards (Ye et al., 2020, Choi et al., 3 Feb 2026, Ringstrom et al., 2020, Lai et al., 2020).

1. Foundational Principles of Chain-of-Goals Hierarchical Policies

At the core of CoGHP frameworks is the explicit factorization of a complex, long-range objective into a sequence or chain of manageable sub-goals. These sub-goals are chosen to be either directly reachable (e.g., "approach the couch currently visible") or latent, embedding-based waypoints in state or configuration space. The higher-level policy, or controller, determines a sequence of such sub-goals based on current state observations and the ultimate objective; the lower-level policy (or skill) executes atomic actions or short-horizon policies to accomplish each sub-goal before the next is issued. This two-level (or multi-level) structure is formalized as a Markov Decision Process (MDP) or partially observable variant, where the high-level agent operates at a slower timescale, setting sub-goals, and the low-level agent delivers primitive control conditioned on those sub-goals (Ye et al., 2020, Zheng et al., 2017). The chain-of-goals principle is realized across diverse domains, including navigation, robotic manipulation, task planning with logical constraints, lifelong skill composition, and multi-goal imitation.

2. Mathematical and Algorithmic Formulations

Hierarchical Markov Decision Processes

A typical CoGHP formalization involves a high-level policy $\pi_h(\cdot|s,g)$ , which outputs a sub-goal $\mathrm{sg}_t$ conditioned on the current observation $s$ and final goal $g$ ; the low-level policy $\pi_\ell(a|s,g,sg)$ then selects primitive actions $a$ to reach $sg$ . Sub-goals may be discrete (e.g., visible semantic objects), continuous (subspace embeddings), or abstract (logical entities) (Ye et al., 2020, Sukhbaatar et al., 2018, Zheng et al., 2017).

Autoregressive Sequence Modeling

Recent work reformulates the high-low factorization as a pure autoregressive sequence model. Given $(s, g)$ , the policy $p_\theta(z_{1:H}, a|s, g)$ emits a chain $z_{1:H}$ of $H$ latent subgoals (each conditioning on $z_{i+1:H}$ ) and outputs the primitive action $a$ . The backbone is often an MLP-Mixer architecture enforcing causal, chain-structured dependencies among state, goal, subgoals, and action tokens (Choi et al., 3 Feb 2026).

Linearly-Solvable Goal-Kernel Dynamic Programming

In the LS-GKDP framework, long-horizon, possibly non-Markovian logical tasks with ordering constraints are decomposed into a set of goal-conditioned options. Each option's first-exit dynamics are encapsulated in a "goal kernel" (jump operator), and the global optimal meta-policy over the goal-chain is recovered as the principal eigenvector of a large, sparse kernel matrix. This enables globally optimal, compositional action selection with zero-shot transfer properties (Ringstrom et al., 2020).

Skill Delegation and Lazy Expansion

Another CoGHP flavor employs a collection of policies, each specialized for realizing a particular effect (a sub-goal). The high-level planner dynamically delegates the fulfillment of each sub-goal to the corresponding skill policy, which, in turn, recursively decomposes any unmet preconditions, yielding a plan tree that is only expanded as needed based on immediate state (Lai et al., 2020).

Monte Carlo Tree Search over Goal-Conditioned Skills

CoGHP frameworks may integrate tree search (MCTS) in the space of high-level actions, where each tree node represents a state and outgoing edges represent learned or composed goal-conditioned policies (HLAs). Planning explores chains of reusable sub-policies, maximizing coverage and sample efficiency (Rens, 3 Jan 2025).

3. Reward Structures, Training Objectives, and Hierarchical Optimization

CoGHPs universally leverage shaped or intrinsic reward signals at the sub-goal level. The low-level controller receives dense (often binary) intrinsic rewards for accomplishing each sub-goal, sharply reducing the sparsity of feedback relative to end-to-end task rewards (Ye et al., 2020, Ye et al., 2021). The high-level policy is optimized to maximize expected extrinsic return, i.e., successful achievement of the final goal, through the effective sequencing of sub-goals.

The most prevalent joint training approach uses option-critic or DQN-style hierarchical value decomposition:

Low-level loss: minimize DQN Bellman error w.r.t. intrinsic rewards for sub-goals
High-level loss: minimize DQN Bellman error w.r.t. extrinsic rewards, propagating through the low-level Q functions
Option termination is trained by a policy gradient–like update to maximize the difference between continuing with the current sub-goal and switching (Ye et al., 2020)

Advanced models exploit Advantage-Weighted Regression for subgoal/action heads, shared value functions (e.g., goal-conditioned IQL), and Bellman operators lifted to augmented (state, subgoal, goal) spaces, often with HER for data augmentation and stability (Choi et al., 3 Feb 2026, Serris et al., 27 Mar 2025).

4. Empirical Performance and Comparative Analysis

Experimental validation across domains has shown CoGHP outperforms flat and single-subgoal hierarchical RL baselines, particularly on long-horizon and sparse-reward tasks:

In House3D object search, CoGHP (HIEM) achieves success rate (SR) of 1.00 and SPL ≈ 0.72, outstripping DQN (SR=0.47, SPL=0.20), h-DQN (SR=0.74, SPL=0.17), and Option-Critic (SR=0.14) (Ye et al., 2020)
In OGBench navigation and manipulation, autoregressive CoGHP achieves, e.g., 79±8% SR on pointmaze-giant-navigate-v0 (vs 46±9% for HIQL), and 54±5% on cube-double-noisy-v0 (vs 2±1% for HIQL) (Choi et al., 3 Feb 2026)
Conditioning policies on sequences of goals (vs a single goal) yields faster convergence, higher stability, and robust success in navigation and pole-balancing (near-100% SR in Dubins and PointMaze for $M_{2G}$ , compared to <50% for “final-only” or myopic alternatives) (Serris et al., 27 Mar 2025)
In goal-directed manipulation domains, LS-GKDP achieves zero-shot task transfer and tractability with super-exponential task combinatorics, unattainable by flat value iteration (Ringstrom et al., 2020)
Policy delegation maintains near-optimal plan lengths and 100% success, even under high noise, and vastly outpaces classic planning or flat RL in large task spaces (Lai et al., 2020)

5. Interpretability, Transfer, and Reusability

A hallmark of CoGHP approaches is interpretability: the explicit chain sg₀ → sg₁ → ... defines a semantically meaningful route from start to final goal, enabling readout and visual analysis of agent behavior (Ye et al., 2020, Ye et al., 2021). Option-based and hierarchical policies organize knowledge so that high-level compositions (chains) can be transferred or remapped to new tasks, often achieving zero-shot generalization provided task-structural invariances hold (Ringstrom et al., 2020). Monte Carlo tree-based and skill-delegation schemes permit amortized reuse of sub-policies and accumulated value knowledge, leading to orders-of-magnitude improvements in planning and exploration efficiency (Rens, 3 Jan 2025, Lai et al., 2020).

6. Limitations and Theoretical Insights

Single-subgoal or myopic hierarchies often fail on tasks where certain intermediate objectives can be reached in ways that make subsequent goals inaccessible—a problem CoGHP addresses by conditioning low-level policies on multiple (or entire chains of) goals (Serris et al., 27 Mar 2025). The presence of hierarchical credit assignment, end-to-end value shaping, and explicit chain conditioning corrects this, as subgoals are selected and pursued in ways that guarantee reachability throughout the policy chain.

The LS-GKDP approach generalizes classic LMDP theory, enabling linearly-solvable policies for non-Markovian, temporally extended tasks with logical orderings; key properties include sparse computation and invariance under grounding changes (Ringstrom et al., 2020). Delegation-based CoGHP is provably robust and computationally efficient due to lazy plan expansion and local subproblem optimality (Lai et al., 2020).

7. Applications and Future Directions

Chain-of-Goals Hierarchical Policies have been deployed in vision-based robotic object search, long-horizon robot navigation and manipulation, grid-world planning with logical constraints, sports trajectory synthesis, and lifelong multi-goal skill learning (Ye et al., 2020, Choi et al., 3 Feb 2026, Zheng et al., 2017, Sukhbaatar et al., 2018).

Empirical evidence suggests that extending the chain depth—allowing policies to condition on two or more successive sub-goals—yields further improvements in sample efficiency and learning robustness (Serris et al., 27 Mar 2025). Open research directions include scaling such methods to larger state and goal spaces, efficient chain-length selection/adaptation, integration with large-scale sequence models (e.g., Transformers, Mixers), and real-world robotic deployment in unstructured and partially observable environments.