Option Graph for Temporal Abstraction

Updated 13 December 2025

Option Graph for Temporal Abstraction is a formal structure that organizes temporally extended actions (options) into graphs, with nodes representing options and edges reflecting initiation, termination, or compositional transitions.
It employs dynamic, recursive constructions and hierarchical nesting to facilitate model-based compositional planning, thereby enhancing sample efficiency and planning depth.
Empirical implementations of Option Graphs, through algorithms like Hierarchical Option-Critic, OOMI, and variational methods, demonstrate robust performance in both discrete and continuous domains.

An Option Graph for temporal abstraction is a formal, algorithmic, and representational structure in hierarchical reinforcement learning and planning that organizes and composes temporally-extended actions—called options—into graphs whose nodes correspond to options or option-induced abstract states, and whose edges capture option initiation, termination, or compositional planning relationships. This structure underpins methods for learning and leveraging temporal abstractions to enhance sample efficiency, planning depth, and representation of hierarchical structure in complex sequential decision-making domains. State-of-the-art methodologies construct, optimize, and utilize such graphs for compositional planning, deep option hierarchies, and latent-reasoning abstractions in both discrete and continuous spaces.

1. Formal Definitions and Structural Foundations

The option framework generalizes actions in Markov Decision Processes (MDPs) to temporally-extended macro-actions. An option is defined as a triple $o = (\mathcal{I}_o, \pi_o, \beta_o)$ , where $\mathcal{I}_o \subseteq S$ is the initiation set, $\pi_o$ is the intra-option policy, and $\beta_o$ is the termination condition (Silver et al., 2012, Riemer et al., 2018, Young et al., 2023).

An Option Graph is a directed multigraph or tree structure that encodes relationships between these options. Nodes can represent option models, high-level option states, or abstract option-augmented state tuples. Edges indicate one-step compositional transitions, control handoff via option termination, or option-to-option policy selection (Riemer et al., 2018, Silver et al., 2012, Young et al., 2023, Li et al., 22 Jul 2025).

Formally, in the model-compositional perspective (Silver et al., 2012), each option corresponds to a model $M_o = \left( \begin{smallmatrix} 1 & R^o \ 0 & P^o \end{smallmatrix} \right)$ , capturing expected returns and transitions until termination. Compositional planning recursively builds hierarchies of such options, forming nodes and edges in an emergent Option Graph. Abstract Markov Decision Processes (HiT-MDPs) further generalize this to continuous and embedding-based settings, defining nodes as state-option pairs $(s, o_{prev})$ and edges by option-conditioned transitions (Li et al., 22 Jul 2025).

2. Construction and Recursion: Dynamic Building of Option Graphs

Dynamic construction of Option Graphs arises from recursive composition and joint optimization procedures:

Model-based composition: In compositional planning (Silver et al., 2012), the Option Graph grows through the recursive application of the generalized Bellman equation:

$M^*_G = \arg\max_{(\pi, \beta)}( E_\pi(O) \circ E_\beta(I,M^*_G) )$

where $E_\pi(O)$ averages over base option models, and $E_\beta$ encodes stochastic termination. Each iteration extends nodes (option models) and edges (composition relations).

Hierarchical option nesting: In deep hierarchical methods (e.g., Hierarchical Option Critic, HOC), options are organized into levels, each with their own intra-option policy $\pi^\ell$ , and the graph enforces strict nesting via “call-and-return” semantics. Each node at level $\ell$ points to child options at level $\ell+1$ , creating a directed tree structure (Riemer et al., 2018).
Option-induced state graphs: In Markovian latent abstraction settings, the HiT-MDP (Li et al., 22 Jul 2025) is itself a graph whose nodes are all possible (state, previous-option) tuples, and whose edges encode possible transitions under jointly-chosen action-option pairs. The adjacency between pure option indices can be represented by a learned matrix $A_{ij} = \mathbb{E}_{s\sim\mu}\, [\pi^O(o_{new}=j|s,i)]$ .
Learning by planning: In Option Iteration (Young et al., 2023), options are discovered and refined by matching multi-step search rollouts, implicitly inducing a directed graph over learned temporal abstractions.

3. Algorithms and Policy Optimization in Option Graphs

Option Graphs serve as substrates for a diverse range of learning and planning algorithms:

Hierarchical Option-Critic (HOC): Supports an arbitrarily deep hierarchy of option sets $\Omega^1, ..., \Omega^{L-1}$ and a set of primitive actions $\Omega^L = \mathcal{A}$ , with distinct intra-option policies and termination functions per level. Policy and termination gradients operate hierarchically across the graph’s levels, maximizing the return via discounted occupancy and advantage terms (Riemer et al., 2018).
Compositional Planning (OOMI): The Option-Option Model Iteration algorithm iteratively constructs new option models and updates existing ones according to the model-optimality equation. Each sweep evaluates and selects among candidate option compositions, updating the Option Graph’s structure and enabling compositional value iteration with complexity $O(m n (K+m))$ per sweep (Silver et al., 2012).
Variational Markovian Option Critic (VMOC): Learns stochastic option embeddings and policies via a variational ELBO, maximal-entropy objective, and soft Bellman backups. The Option Graph emerges from the induced latent space and is utilized both for reasoning and control (Li et al., 22 Jul 2025).
Option Iteration: Maintains and trains a set of options such that for each encountered state, one matches the optimal trajectory for a random horizon, with gating policy $\rho(n|s)$ . The OptIt loss enforces probabilistic mixture matching between the learned set and search rollout distributions over fixed horizons, and the Option Graph captures the inter-option transition structure (Young et al., 2023).

4. Empirical Performance and Domain-specific Applications

Option Graph-based algorithms have empirically demonstrated pronounced sample efficiency, robust planning, and improved policy generalization:

Method	Key Empirical Results	Domains Tested
HOC (Riemer et al., 2018)	30% reduction in sample complexity over 2-level OC;	Four-Rooms, Stochastic Chain,
	higher performance in Atari multi-task (+1.5 A2HOC vs. A3C);	Multi-story Navigation, Atari
OOMI (Silver et al., 2012)	Solves Tower of Hanoi in $O(N)$ sweeps, vs $O(2^N)$ in flat VI	Tower of Hanoi, Nine Rooms
VMOC (Li et al., 22 Jul 2025)	Outperforms PPO, OC, and baselines in MuJoCo control; best OOD logical reasoning	MuJoCo, GSM8k, SVAMP, CSQA
OptIt (Young et al., 2023)	Near-optimal returns in Compass/ElectricProcMaze;	Gridworld, Maze, Hierarchical
	closes sample efficiency gap over Expert Iteration	ElectricProcMaze

This table summarizes domain-specific efficacy, with Option Graph frameworks exhibiting substantial improvements in structured, compositional, or high-level reasoning environments.

5. Representational and Theoretical Properties

The graph-based abstraction of options is foundationally connected to value-preserving homomorphisms, compositional theory, and multi-level temporal reasoning:

Homomorphisms and value equivalence: Continuous HiT-MDP homomorphisms guarantee that value functions and optimal policies learned in the Option Graph are preserved when lifted back to the original MDP (Li et al., 22 Jul 2025). This preserves optimality of learned abstractions.
Recursive closure: Option Graphs constructed by model composition or hierarchical selection are closed under repeated application, enabling the representation of arbitrarily long temporal abstractions or macro-operators.
Directed multigraph structure: In compositional planning, both options and option models are nodes; compositional relationships define edges and support macro-level planning over the graph, dramatically reducing computational cost in structured tasks (Silver et al., 2012).
Graph-embedding perspective: In option-embedding architectures, nodes are mapped to continuous vectors, and adjacency (option-to-option transition probabilities) is learned, providing a basis for unsupervised option discovery and reasoning (Li et al., 22 Jul 2025).

6. Extensions: Planning, Reasoning, and Deep Option Graphs

Option Graphs admit further extensions and generalizations:

Deep option hierarchies: HOC and related work enable learning at multiple temporal resolutions, with options at each level specializing to particular temporal scales observed in the environment (Riemer et al., 2018).
Multi-subgoal and compositional planning: The OOMI algorithm rapidly constructs hierarchical options for multiple subgoals simultaneously, enabling efficient solution of problems with exponential procedural depth (e.g., Tower of Hanoi) in polynomial time (Silver et al., 2012).
Latent-option reasoning in abstraction: The VMOC framework applies the Option Graph principle to abstract CoT-style reasoning, with latent options capturing implicit reasoning steps and enabling cold-start transfer from human chain-of-thought data (Li et al., 22 Jul 2025).
Dynamic option discovery: Option Iteration links search (planning) and learning, using a dynamic mixture of temporally-extended options discovered by matching value estimates and rollouts, and constructing a fluid option graph that captures experience-dependent structure (Young et al., 2023).

Collectively, the Option Graph for temporal abstraction provides a unifying structure for organizing, learning, and exploiting hierarchies of temporally-extended actions across planning, reinforcement learning, and abstract reasoning, with practical and theoretical guarantees for sample-efficient learning, value preservation, and compositional generalization (Riemer et al., 2018, Silver et al., 2012, Young et al., 2023, Li et al., 22 Jul 2025).