Graph-Guided Sub-Goal RL (G4RL)

Updated 21 November 2025

The paper introduces a framework that integrates graph-structured abstractions into hierarchical RL to generate, evaluate, and select effective sub-goals.
It employs discrete, relational, and neural graph construction methods to guide exploration and enhance sample efficiency in sparse-reward, long-horizon tasks.
Empirical results in robotics and grid-world tasks demonstrate robust planning, self-imitation, and zero-shot generalization through graph-guided sub-goal strategies.

Graph-Guided sub-Goal Representation Generation RL (G4RL) is a class of hierarchical reinforcement learning (HRL) and goal-conditioned RL (GCRL) frameworks that leverage discrete or learned graph-structured abstractions of the environment, state, or goal space to generate, evaluate, and select sub-goal representations. These sub-goals guide exploration, credit assignment, and policy learning in sparse-reward or long-horizon tasks, often with improved sample efficiency and generalization. G4RL encompasses multiple implementations, including geometric subgoal graphs with classical RL, relational goal graphs with Bayesian estimation, and neural graph-encoder/decoder models used as intrinsic reward sources in modern actor-critic RL.

1. Graph Construction and Representation

G4RL methods instantiate graph structures to capture the environment's connectivity, transition reachability, or relational structure among states and goals. Notable constructions include:

State/goal graphs: Nodes correspond to environment states or goal descriptors. Edges encode physical, topological, or policy-induced transitions.
Graph-based Discretization: In robotics, the accessible goal space $\mathcal{G}_a$ is gridded, excluding points within axis-aligned obstacle bounding boxes. Vertices are regular spatial points; 26-connected adjacency encodes possible transitions, with weights given by Euclidean distance. Edge inclusion is constrained so that no edge cuts through an obstacle. All-pairs shortest-path distances are then precomputed (e.g., Dijkstra's algorithm) for efficient lookup (Bing et al., 2020).
Goal Relational Graph (GRG): A complete directed graph with weights $w_{ij} \in [0,1]$ defined as expected discounted first passage probabilities, learned using a Dirichlet-categorical Bayesian model. Posterior updates are made after each low-level sub-goal attempt, and relational "cost" measures are derived from optimal paths in this graph (Ye et al., 2021).
Dynamically Built Graphs: In policy learning, graphs are built online from a replay buffer (by Farthest-Point Sampling of state or goal vectors), computing edge weights based on estimated policy value (e.g., $d(l^1, l^2) \approx -Q_\phi(s^1, \pi_\theta(s^1, l^2), l^2)$ ). Edges longer than a threshold are pruned (Kim et al., 2023).
Neural Graph Encoders: The state-graph is constructed during exploration, with nodes storing state features and edge weights reflecting number of traversals or estimated connectivity. A graph encoder-decoder network is trained to predict the adjacency structure, enabling embedding of unseen states for intrinsic reward shaping (Zhang et al., 14 Nov 2025).

2. Hierarchical Learning Architecture and Sub-Goal Integration

A hallmark of G4RL is the use of hierarchical decomposition, commonly in two-layer architectures:

High-Level Controller: Proposes sub-goals, typically at a slower timescale, based on partial observations, visible candidate sub-goals, and the graph-derived relational/planning knowledge. In GRG-based HRL, sub-goal selection incorporates path-planning over the goal relations graph, with sub-goal embeddings modulated by a relational score $C(\tau^*_{sg, g})$ of the path from candidate sub-goal $sg$ to the final goal $g$ (Ye et al., 2021).
Low-Level Controller: Executes primitive actions to reach the given sub-goal, formulating the problem as fully observed (conditioned on sub-goal). It can be trained with intrinsic rewards for sub-goal arrival, typical RL objectives, or shaped with auxiliary losses from the graph encoder or relational predictions (Zhang et al., 14 Nov 2025, Ye et al., 2021).
Termination Logic: The low-level controller returns control to the high-level (optionally) if the agent encounters any preferable sub-goal along the optimal path to the target, avoiding forced detours and improving exploration efficiency (Ye et al., 2021).
Policy-Shaping via Self-Imitation: Some G4RL approaches add a loss to the actor that distills sub-goal-mode policies into the target-goal policies, enforcing consistency along the planned sub-goal sequence (Kim et al., 2023).

3. Sub-Goal Generation, Planning, and Selection

Sub-goal representation and selection are graph-guided:

Discrete Planning: In geometric or spatial G4RL (e.g., SSG + RL), the high level runs fast A* search over the subgoal graph to extract an optimal sequence of sub-goals from start to goal. Each sub-goal is approached sequentially, with each segment addressed by a learned RL policy for either subgoal-approaching or obstacle avoidance (Zeng et al., 2018).
Relational Graph Planning: In GRG-based HRL, path planning is performed on the graph of goal relations, determining high-probability paths and relational "scores." Candidate sub-goals are modulated by the expected ease/probability as measured by the GRG (Ye et al., 2021).
Learned/Value-based Planning: Dynamic graph-based methods for long-horizon tasks generate sub-goal sequences from graph-based shortest plans, where edge weights reflect policy-gated value estimates. Stochastic sub-goal skipping is introduced to enhance exploratory coverage by probabilistically jumping ahead in the planned sub-goal sequence when the policy exhibits confidence (low imitation loss) (Kim et al., 2023).
Graph Encoder Inference: In environments with continuous state spaces, a learned encoder projects arbitrary state vectors into the latent graph-embedding space, with decoder dot-products yielding a learned similarity metric. This allows sub-goal selection or intrinsic reward computation even for previously unvisited states (Zhang et al., 14 Nov 2025).

4. Training Objectives, Algorithmic Workflow, and Intrinsic Rewards

Training in G4RL involves integration of graph-based reasoning with standard RL optimization:

Extrinsic + Intrinsic Rewards: Intrinsic rewards can be shaped by the dot-product similarity between embedded current state and sub-goal (using the learned encoder), or by negative squared distance in the feature space (plus graph-guided novelty terms). The high-level reward often combines environment reward with a graph-decoder term; the low-level similarly augments its reward with graph-based features (Zhang et al., 14 Nov 2025).
Graph Encoder-Decoder Losses: Learning is driven by reconstruction of the adjacency structure from pairs of node embeddings, minimizing a squared loss between predicted and normalized observed adjacency (Zhang et al., 14 Nov 2025).
Off-policy Updates: G4RL typically uses actor-critic or Q-learning methods at each hierarchy level (e.g. DDPG, TD3, Double DQN), with experience replay augmented as necessary (e.g., with Hindsight Experience Replay) (Bing et al., 2020, Zhang et al., 14 Nov 2025).
Sub-goal Distillation/Imitation: In self-imitation architectures, the actor loss is extended by an L2 loss enforcing that target-goal-conditioned policies match those for intermediate sub-goals along the planned sequence (Kim et al., 2023).
Algorithmic Loop: A typical cycle (see (Zhang et al., 14 Nov 2025, Ye et al., 2021)) may be summarized as:
1. Update or expand the graph with new encountered states or goals.
2. Perform path-planning or relational cost evaluation to select promising sub-goals.
3. Propose sub-goals and collect episodes via high- and low-level policies.
4. Store transitions and update graph statistics.
5. Periodically retrain graph encoders or update GRG weights.
6. Update policy networks with data, using shaped rewards and/or distillation losses.

5. Empirical Results and Generalization Properties

Extensive validation of G4RL mechanisms across domains has been reported:

Robotic Manipulation: In FetchPushLabyrinth and similar OpenAI Gym robotics domains with obstacles, G4RL frameworks using graph-based sub-goal generation (e.g., G-HGG) demonstrated marked gains in sample efficiency and success rate, reaching >90% success by 300k training iterations in tasks where HER and Euclidean HGG failed to make progress. Sub-goal trajectories generated by graph guidance correctly avoid obstacles, unlike naively Euclidean approaches (Bing et al., 2020).
Partial Observability and Zero-Shot Generalization: The GRG-based variant yields superior generalization to novel goals or environment layouts. In 16×16 grid-worlds (16 goals per map, 12 trained/4 held out), G4RL achieved overall SR=0.74, SPL=0.46, compared to SR=0.45 for canonical hierarchical DQN, and was robust to unseen goals and environments (AI2-THOR, House3D) (Ye et al., 2021).
Long-Horizon Control: Graph-guided self-imitation markedly improved long-horizon learning. For example, in large U-shaped AntMaze, success at 1 million steps improved from ≈19% (MSS baseline) to ≈57% with graph-guided self-imitation. Policy performance persisted even with the planner removed at test time, indicating strong internalization of the structural knowledge (Kim et al., 2023).
Path Planning and Robotics: The two-level SG-RL achieves sub-millisecond planning for abstract sub-goal paths, short and smooth trajectories in large-scale real maps, and robust recovery from unexpected obstacles via local RL policies, with action-switching frequency consistently below 10% (Zeng et al., 2018).
Sample Efficiency and Policy Diversity: Empirical results indicate that G4RL increases sample diversity, state-entropy, and robustness to stochasticity in transitions, broadening the agent's experience space (Kim et al., 2023, Zhang et al., 14 Nov 2025).

6. Practical Implementation and Limitations

Implementing G4RL requires careful attention to graph construction, computational tradeoffs, and environment representation:

Graph Construction Overhead: Finer grid or larger graph improves distance/proximity modeling but increases pre-computation, storage, and update cost. Sparse graphs trade off accuracy and speed; dynamic construction or sampling schemes (e.g., FPS) can manage this (Zhang et al., 14 Nov 2025, Kim et al., 2023).
Policy-Conditioned Connectivity: Dynamic or learned graphs must robustly encode transitions feasible for the current policy. Value-based edge weights or Dirichlet posterior updates tie planning efficacy to policy learning (Ye et al., 2021, Kim et al., 2023).
Low-Dimensional State Constraints: Most G4RL evaluations to date involve compact, spatial or semantic state representations. Applying neural graph encoders to visual state-spaces or high-D observations would require additional metric learning or state embedding techniques (Zhang et al., 14 Nov 2025, Kim et al., 2023).
Environment Symmetry/Transition Regularity: Some G4RL instantiations assume symmetric or reversible transitions for maximal benefit. Performance in highly asymmetric or irreversible settings may require modifications (Zhang et al., 14 Nov 2025).
Exploration vs. Exploitation Tradeoffs: Stochastic sub-goal skipping, graph-governed early termination, and reward shaping substantially affect the learning progression, requiring calibrated hyperparameters (e.g., $\lambda$ , $\alpha$ , stopping thresholds). Performance is empirically robust to these values within broad ranges (Kim et al., 2023, Ye et al., 2021).
Computational Cost: Graph learning and encoder training approximately double computational time relative to base GCHRL in some settings, but performance improvements can offset this via faster learning or higher policy quality (Zhang et al., 14 Nov 2025).

Key References and Variants:

Core Idea	Paper Title / arXiv ID	Environment/Domain
Spatial subgoal graph + LSPI	"Combining Subgoal Graphs..." (Zeng et al., 2018)	Robotics, path planning
Graph-based hindsight goal generation	"Complex Robotic Manipulation..." (Bing et al., 2020)	Fetch robotics
Dirichlet-Categorical relational GRG	"Hierarchical...with GRG" (Ye et al., 2021)	Grid-world, AI2-THOR, House3D
Graph-based self-imitation/distillation	"Imitating Graph-Based Planning..." (Kim et al., 2023)	AntMaze, Reacher, etc.
Neural graph encoder-decoder for GCHRL	"Incorporating Spatial Information..." (Zhang et al., 14 Nov 2025)	Diverse GCHRL

G4RL offers a principled methodology for integrating graph-structured abstractions into hierarchical and goal-conditioned RL, enabling richer sub-goal representations, robust planning, and substantial empirical gains in complex and partially observable domains.