AgentOWL: Hierarchical Options & World Modeling

Updated 6 February 2026

AgentOWL is a hierarchical reinforcement learning framework that integrates the learning of temporally extended skills (options) with an abstract world model for efficient planning.
The framework interleaves model-based and model-free learning, employing value-preserving abstractions and LLM-aided subgoal discovery to construct deep, compositional skill hierarchies.
Empirical results demonstrate that AgentOWL outperforms baseline methods in object-centric Atari and high-dimensional navigation tasks with significantly reduced environment interactions.

AgentOWL (Option and World Model Learning Agent) is a framework for hierarchical reinforcement learning (HRL) that interleaves the learning of temporally extended skills (“options”) with the construction of an abstract world model. The central principle is to achieve sample-efficient and autonomous learning of deep skill hierarchies through the composition of options and value-preserving, temporally abstract modeling. AgentOWL enables planning with options, model-based skill composition, efficient generalization to new goals, and provides theoretical guarantees on bounded value loss in the resulting policies. Key empirical studies demonstrate that AgentOWL masters more options, more quickly, and requires fewer environment interactions than baseline methods, particularly in object-centric Atari domains and high-dimensional navigation environments (Rodriguez-Sanchez et al., 2024, Piriyakulkij et al., 2 Feb 2026).

1. Formalization and High-Level Architecture

AgentOWL is defined over a discounted Markov Decision Process (MDP) $M = (S, A, P, R, \gamma)$ , with a potentially continuous state space $S$ and action space $A$ . A collection of options $O = \{o_1, ..., o_K\}$ , each described as $o = (I_o, \pi_o, \beta_o)$ (initiation set, intra-option policy, termination function), produces a temporarily abstracted action space. By treating options as actions, AgentOWL induces a grounded SMDP (semi-MDP): $M_O = (S, O, P_O, R_O, \gamma)$ , where transitions and rewards integrate over option temporal durations:

$P_O(s'|s, o) = \sum_{\tau=1}^{\infty} \gamma^{\tau} \Pr(S_\tau = s', \text{terminate at } \tau | s, o)$
$R_O(s, o) = \mathbb{E}[\sum_{t=1}^{\tau-1} \gamma^t R(S_t, A_t)\ | s, o]$

AgentOWL interleaves two core algorithmic components:

Abstract World Model (AWM): predicts distributions of future abstract features $p_o(f'|s)$ for each option.
Hierarchy of Neural Options: each $o_i = (\pi_i, g_i)$ includes a policy $\pi_i$ (which may invoke lower-level options) and a goal predicate $g_i$ (option termination).

The architecture supports planning in the abstract space via AWM to generate high-level plans, which are refined and executed in the environment by recursively unrolling option policies, allowing data-efficient skill composition and deep hierarchy acquisition (Piriyakulkij et al., 2 Feb 2026).

2. Abstraction Mechanisms: State and Time

State abstraction in AgentOWL denotes mapping ground states $s$ to a low-dimensional latent $z = \varphi(s)$ or abstract feature vector $f(s)$ , typically encoding achieved goal predicates. The abstraction must be “dynamics-preserving” (or “value-preserving”): for every $s$ and $o$ ,

$P_O(s'|s, o)\approx \widetilde{P}(s'|\varphi(s), o), \qquad I_o(s) = I_o(\varphi(s))$

This property ensures that planning with the abstract SMDP $\overline{M} = (\overline{Z}, O, \overline{P}, \overline{R}, \gamma)$ (where $\overline{Z}$ is the abstract state space and $\overline{P}, \overline{R}$ marginalize over $\varphi^{-1}(z)$ ) can produce near-optimal policies for the original environment (Rodriguez-Sanchez et al., 2024).

Time abstraction is intrinsic—the abstract model predicts transitions not at each primitive time step but after arbitrary option execution durations, compressing long sequences into a single prediction step.

Feature learning and abstraction are driven by maximizing mutual information objectives using InfoNCE contrastive loss: encouraging $z$ to be maximally predictive of successor states and option initiability (Rodriguez-Sanchez et al., 2024).

3. Abstract World Model Learning and Joint Optimization

AgentOWL alternates between model-based and model-free learning:

AWM Learning: The world model $T = \{p_o\}_o$ parameterizes transition dynamics via a product-of-experts (“PoE-World”) approach:

$p_o(f'|s) = \prod_j p_o(f'_j|s), \qquad p_o(f'_j|s) \propto \prod_{i: \text{expert}_i\to\text{feature}_j} p_i(f'_j|s)^{\theta_i}$

Programmatic experts $p_i$ (e.g., Python-like rules via LLMs) provide inductive structure, and weights $\theta_i$ are fit with maximum a posteriori estimation and regularization (“frame axiom prior”). A replay buffer $D$ accumulates transitions $(s, o, f', r^\gamma, \tau, I)$ for supervised and maximum-likelihood training.

Option Policy Learning: Each option maintains both a real-world policy $\pi_o^{real}$ and an imagined policy $\pi_o^{wm}$ (trained purely in AWM). The mixture policy

$\pi_o(a|s) = (1 - \epsilon)\, \pi_o^{real}(a|s) + \epsilon\, \pi_o^{wm}(a|s)$

is used during data collection, where $\epsilon$ anneals from $1 \to 0$ to reduce model bias as real data grow. The policy losses are based on DQN-style temporal-difference objectives. In model-based planning, task-specific reward functions can be imposed directly in the abstract space, and value iteration, DQN rollouts, or MCTS is applied.

Subgoal Discovery and Skill Hierarchies: When no combination of existing options achieves a new goal, an LLM can hypothesize novel subgoals; new options $o_{h \rightarrow g}$ are instantiated and added to the hierarchy and world model (Piriyakulkij et al., 2 Feb 2026).
Stable Learning: To guarantee training stability, only episodes where all invoked sub-options are “stable” (i.e., sufficiently trained or successful) are permitted in the replay buffer for higher-level option updates.

4. Value-Preserving Planning and Theoretical Guarantees

The key theoretical property of AgentOWL is bounded value loss under model approximation errors. Theorem 4.2 (Rodriguez-Sanchez et al., 2024) states:

If abstractions $(\varphi, \overline{P}, \overline{R})$ satisfy

$\|P_O(\cdot|s,o) - \widetilde{P}(\cdot|\varphi(s), o)\|_1 \leq \varepsilon_T,\qquad |R_O(s,o)-\overline{R}(\varphi(s), o)|\leq \varepsilon_R$

for all $(s, o)$ , then for any policy $\pi$ :

$|Q^\pi(s,o) - \overline{Q}^\pi(z, o)| \leq \frac{\sqrt{\varepsilon_R}+ \gamma V_{max}\sqrt{\varepsilon_T}}{1-\gamma}$

Thus, optimal abstract planning yields grounded policy value within $O((\sqrt{\varepsilon_R}+\sqrt{\varepsilon_T})/(1-\gamma))$ of the optimal, provided errors are controlled.

The proof unrolls the Bellman recursion for SMDPs and relates one-step model errors to overall value loss, via total variation and value bounds over the planning horizon.

5. Algorithmic Workflow and Modules

AgentOWL can be instantiated as the following modular workflow (Rodriguez-Sanchez et al., 2024, Piriyakulkij et al., 2 Feb 2026):

Option Module: Supply or discover a small library $O$ of skills; options may be hand-designed or learned, and can include intra-option policy recursion.
Abstraction Module: Learn an abstract mapping $\varphi:S\to Z$ , via temporal predictive coding and mutual information, ensuring relevance for predicting option-termination and effect.
Abstract Model Module: Fit $\overline{P}(z'|z,o)$ , $\overline{R}(z,o)$ , $\overline{\tau}(z,o)$ in latent space via maximum likelihood and regression.
Planner Module: Plan in abstract SMDP $\overline{M}$ via RL or search to yield an abstract policy $\overline{\pi}$ .
Execution / Data Collection: Unroll the abstract policy in the ground MDP by invoking options, collect new data, refine all models, and (optionally) discover additional sub-options as needed based on coverage or subgoal heuristics.
Stability Module: Filter updates to higher-level options using only “stable” lower-level option trajectories, mitigating non-stationarity in deeply nested hierarchies.

6. Empirical Results and Comparative Performance

Extensive evaluation demonstrates that AgentOWL is substantially more sample-efficient and skill-competent than baseline methods.

Table: Empirical Comparison Across Domains

Domain	AgentOWL Success / Skills	Baseline (Best)	Noteworthy Observations
Pinball, AntMaze	≥90% with 2–5× fewer steps	DDQN	Only abstract planning achieves high performance
OCAtari (Montezuma, etc.)	5–6 skills in 5M frames	1–2 skills	Plateaus for model-free; AgentOWL only agent mastering hard goals
Private Eye (zero-shot)	100% after extra navigation	~0%	Option composition enables transfer

Visualizations of state abstraction ( $\varphi(s)$ via MDS) reveal retention of only task-relevant continuous coordinates (e.g., $(x,y)$ positions), with mutual-information matrices confirming abstraction away from irrelevant features (e.g., joint angles) (Rodriguez-Sanchez et al., 2024).
Subgoal hypothesis and stabilization mechanisms are critical, as ablation experiments show substantial performance drops when either is omitted (Piriyakulkij et al., 2 Feb 2026).
In zero-shot generalization, AgentOWL composes options learned in prior goals to achieve new objectives with no additional training.
Buffer filtering prevents unstable sub-option episodes from entering higher-level policy updates, which stabilizes training of deep option hierarchies.

7. Distinctives and Significance in Hierarchical Reinforcement Learning

AgentOWL distinguishes itself via several principled advances:

World-model-guided exploration: Using a learned abstract model to direct data collection towards promising behaviors and away from random low-level exploration.
Temporally abstract planning: Collapsing long action sequences into brief abstract rollouts simplifies both search and learning.
LLM-aided subgoal acquisition: Automated option discovery via LLMs efficiently builds complex hierarchies without exhaustive search.
Theory-grounded value preservation: Explicit quantification of performance loss due to abstraction/model errors, providing provable efficiency and policy reliability.
Stable hierarchical learning: Filtering destabilizing episodes in deeply nested option hierarchies ensures robust policy acquisition.

These properties collectively enable AgentOWL to acquire deep and compositional skill hierarchies with an order-of-magnitude reduction in sample complexity relative to model-free or naïve hierarchical RL frameworks (Rodriguez-Sanchez et al., 2024, Piriyakulkij et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Learning Abstract World Model for Value-preserving Planning with Options (2024)

Joint Learning of Hierarchical Neural Options and Abstract World Model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentOWL (Option and World model Learning Agent).