Scalable Option Learning (SOL)

Updated 2 July 2026

SOL is a hierarchical reinforcement learning framework that defines options as temporally extended actions using abstraction and parallel computation for scalable discovery and transfer.
It integrates methods like Option Iteration, multi-timescale linear options, and meta-learned retrieval to reduce computational complexity and enhance policy performance in diverse settings.
Empirical results show SOL achieves significant gains, including up to 240× speedup and 10–100× improvements in sample efficiency across domains such as NetHack, CraftWorld, and multi-agent control.

Scalable Option Learning (SOL) encompasses a class of frameworks and algorithms in hierarchical reinforcement learning (HRL) focused on the efficient, robust, and principled discovery, representation, learning, and execution of temporally extended actions—options—such that the learning process, computational cost, decision complexity, and policy transferability scale tractably with the dimensionality of the environment, number of tasks, or agents. SOL advances the classical options framework by providing statistical, algorithmic, and systems innovations that enable the use of large option libraries, heterogeneous abstraction levels, or multi-agent skill compositions, all while retaining or improving theoretical and empirical guarantees relative to flat or naïve HRL approaches.

1. Formal Foundations of Scalable Option Learning

The foundational unit of HRL is an option, specified as a tuple $(\mathcal{I}_o, \pi_o, \beta_o)$ , where $\mathcal{I}_o$ is the initiation set, $\pi_o(a \mid s)$ is the intra-option (temporally extended) policy, and $\beta_o(s)$ is the termination condition. SOL generalizes across environments, tasks, and agent populations by operating over either primitive MDPs, option-augmented SMDPs, or abstracted, symbolic state spaces. State abstraction, as in Conditional Abstraction Trees (CATs) or data-driven Voronoi decompositions, provides a scalable substrate for defining option endpoints and planning graphs (Nayyar et al., 2024, Shah et al., 2022).

A critical SOL property is the decoupling of option discovery and low-level policy learning from combinatorial task specification. In Option Iteration, the objective forms a K-step cross-entropy distillation loss (LogSumExp over option likelihoods) encouraging the learned set $\{\pi_n\}$ to cover the behaviors induced by multi-step planning search (Young et al., 2023). In multi-task and multi-agent instantiations, option libraries can scale to hundreds or thousands, with retrieval or selection efficiently mediated by embedding-based affinity functions or spectral subgoal discovery (Chauhan et al., 2022, Chen et al., 2023).

2. Algorithmic Instantiations and Architectural Design

SOL is realized via diverse algorithmic constructions, including:

Option Iteration (OptIt): Iterative connection of planning and option distillation, using Monte Carlo Search over (action, option) pairs and K-step sequence loss to maintain a small set of options providing local coverage of the planning expert’s distributions. Unused options naturally drift out of use, mitigating overfitting and redundancy (Young et al., 2023).
Linear SOL with Multi-timescale Options: Lifting stochastic gradient TD learning to the options/SMDP setting, with randomly generated linear options over multiple time scales. The resulting updates retain $O(n)$ per-step computational cost (where $n$ is the feature dimension) irrespective of the number of options, ensuring practical tractability in high-dimensional or continuous spaces (Kumar et al., 2017).
Multi-agent Scalable Covering Option Discovery: Utilizing Kronecker decomposition of the joint state-space graph Laplacian to factorize Fiedler vector computations, enabling option discovery and composition in multi-agent settings with complexity linear in the number of agents, rather than exponential in the joint state space (Chen et al., 2023).
High-throughput Hierarchical Actor-Critic: Unified network architectures using multi-head output and policy-index embeddings for both controller and option policies, with fully batched, parallel return computation (e.g., V-trace on GPU). This supports throughput at the industrial scale—up to 25×–240× higher than prior methods on NetHack, MiniHack, and Mujoco environments (Henaff et al., 30 Aug 2025).
Task-indexed or Affinity-based Option Retrieval: Meta-learned embedding/query architectures (e.g., Option-Indexed HRL) restrict the active option set for new tasks, making learning in very large option libraries tractable. The QGN computes embeddings from environment/task features to select relevant options in $O(kd)$ time, reducing action space dimensionality and decision complexity by up to 100× in practice (Chauhan et al., 2022).

The following table compares representative SOL instantiations:

Reference	Key Principle	Scaling Mechanism	Domain
(Young et al., 2023)	Option Iteration, planning	K-step distillation, option pruning	Grid, maze, 2-level
(Kumar et al., 2017)	Linear SMDP-TDC, time scales	$O(n)$ cost, random linear options	Grid, RBF-continuous
(Henaff et al., 30 Aug 2025)	Batched actor-critic	Multi-head net, parallel returns	NetHack, Mujoco
(Chen et al., 2023)	Multi-agent Kronecker graph	Factorized Laplacians, eigenvectors	Multi-agent control
(Chauhan et al., 2022)	Option-indexing, retrieval	Affinity embeddings, HRL restriction	CraftWorld, AI2THOR
(Nayyar et al., 2024)	Continual invention, CAT	Symbolic option extraction, planning	Hybrid MDPs

3. Theoretical Guarantees and Convergence Properties

SOL inherits, extends, or adapts several types of theoretical assurance:

Approximate Policy Improvement: In Option Iteration, the mixture-of-options “expert” satisfies a generalized policy improvement condition; selecting among $N$ option-induced rollouts is never worse than any single constituent option (Young et al., 2023).
Linear Convergence with Function Approximation: For SMDP-TDC-based SOL, under standard boundedness and stochastic approximation assumptions, the two-timescale updates converge almost surely to the optimal fixed point for both weights and auxiliary variables, as established via Borkar–Meyn theory (Kumar et al., 2017).
Task Composability and Downward Refinability: In high-level planning, the construction of abstract state graphs and option chaining ensures that abstract plans map to executable low-level trajectories with composable, refinable guarantees, provided each option achieves its subgoal with high probability (Shah et al., 2022).
Complexity Reduction: In multi-agent or large-option domains, Kronecker-sum and embedding-indexed methods reduce computational complexity from $\mathcal{I}_o$ 0 to $\mathcal{I}_o$ 1 (for $\mathcal{I}_o$ 2 agents, $\mathcal{I}_o$ 3 states per agent), and action selection at test time from $\mathcal{I}_o$ 4 to $\mathcal{I}_o$ 5 ( $\mathcal{I}_o$ 6 total options, $\mathcal{I}_o$ 7 relevant per task) (Chen et al., 2023, Chauhan et al., 2022).

4. Empirical Domains and Performance Results

SOL frameworks demonstrate substantial performance gains across a variety of benchmarks and domains:

Planning and Search: OptIt accelerates convergence on Compass, ElectricProcMaze, and hierarchical variants, achieving up to 0.98 average return in 100k steps vs. 0.2 for single-policy methods (Young et al., 2023).
Continuous and Stochastic Path Planning: SOL with region-based abstraction and pseudo-reward-guided option learning achieves ≥80% task success and 2×–5× speedups over flat SAC or replanning RRT, with robust option reuse across multiple start–goal pairs (Shah et al., 2022).
Multi-agent Tasks: Kronecker-graph SOL outperforms MAPPO (without options) by 2–5× in sample efficiency, and scales to 6 agents without exponential blow-up, leveraging multi-agent options (Chen et al., 2023).
High-throughput RL: In NetHack-scale training, SOL achieves up to 25×–240× higher environment steps/sec, and sustained improvement over flat and prior hierarchical baselines in both NetHack and MiniHack, attaining higher scores and demonstrably improved credit assignment for temporally extended rewards (Henaff et al., 30 Aug 2025).
Option Library and Task Generalization: OI-HRL achieves near-oracle retrieval of relevant options, reducing sample complexity by 10–100× and enabling zero-shot performance on test variants in CraftWorld and AI2THOR (Chauhan et al., 2022).
Continual/Long-horizon Tasks: Symbolic abstraction-based CHaPRL solves new instances in large MDPs (Maze, Office, Taxi, Minecraft) in 5–20× fewer steps than PPO or Option-Critic, due to automatic invention and reuse of high-level options (Nayyar et al., 2024).

5. Option Discovery, Abstraction, and Transferability

SOL frameworks frequently combine data-driven state abstraction, hierarchical planning, and transfer learning. Approaches such as conditional abstraction trees, region-based Voronoi diagrams, and spectral embedding (Fiedler) vectors enable principled subgoal and endpoint selection for option definition, yielding options that are composable, reusable, and, in recent work, mutually independent for minimal interference (Nayyar et al., 2024, Shah et al., 2022, Chen et al., 2023).

Transfer mechanisms include:

Meta-learned Affinity-based Indexing: Mapping environment/task descriptors to relevant subsets of the option pool enables effective zero-shot and few-shot transfer, even as library size increases or environmental complexity grows (Chauhan et al., 2022).
Soft Option Priors and Posterior Regularization: In Multi-task Soft Option Learning (MSOL), a shared prior is regularized across many task-specific posteriors, supporting option reuse and specialization without catastrophic forgetting or local-optima collapse (Igl et al., 2019).
Symbolic Invention and Lookahead Planning: In continual RL, options with symbolic endpoints permit direct transfer, refinement, and planning in novel environments without retraining low-level policies (Nayyar et al., 2024).

6. Systems-Level and Computational Scalability

SOL’s tractability at scale is achieved through several complementary strategies:

Network-sharing and Batched Computation: All controller and option policies are parameterized within a single neural network with policy-index multiplexing and multi-head outputs, maximizing hardware efficiency and minimizing context-switching overhead (Henaff et al., 30 Aug 2025).
Parallel and Off-policy Rollouts: Option policies are trained via experience aggregated in parallel actor pools or distributed environments, with off-policy RL algorithms (e.g., SAC) ensuring sample efficiency and parallel data collection (Shah et al., 2022, Henaff et al., 30 Aug 2025).
Low-variance, Masked Return Estimation: Vectorized computation of returns/advantages for multiple policies in a single backward pass, using masking indices, enables orders-of-magnitude speedup in training step throughput (Henaff et al., 30 Aug 2025).
Efficient Option Discovery: Kronecker-graph eigen-decomposition reduces skill selection cost from exponential to linear in agent or state dimensions, while a meta-learned retrieval index reduces test-time computational demand from hundreds/thousands of options to a small relevant subset (Chen et al., 2023, Chauhan et al., 2022).

7. Open Challenges and Limitations

Although SOL methods address major computational and transferability bottlenecks, certain limitations remain. Intrinsic rewards may need to be hand-specified, with automatic synthesis (via novelty/diversity/LLM feedback) as a potential extension. Option termination mechanisms are often based on fixed durations or simplistic rules; interruptibility and flexible control may further improve agility. SOL is optimized for regimes where compute throughput is the main limitation, and combining with model-based or advanced off-policy learning could yield further sample efficiency gains (Henaff et al., 30 Aug 2025).

In summary, Scalable Option Learning constitutes a rigorously developed family of methodologies for principled, efficient, and generalizable hierarchical reinforcement learning. By integrating abstraction, learning, planning, and large-scale computation, SOL frameworks provide the foundation for advancing RL into complex, unstructured, or continual environments at scale (Young et al., 2023, Kumar et al., 2017, Shah et al., 2022, Chen et al., 2023, Igl et al., 2019, Henaff et al., 30 Aug 2025, Chauhan et al., 2022, Nayyar et al., 2024).