Hierarchical Entity-Centric Offline GCRL

Updated 4 February 2026

The paper introduces a hierarchical, entity-centric framework that factorizes states into distinct entities to enable modular subgoal generation under sparse reward conditions.
It employs a two-level architecture combining a low-level goal-reaching policy with a high-level subgoal generator using diffusion or VAE models for temporally abstract planning.
Empirical evaluations demonstrate robust performance in robotic manipulation and navigation, achieving improved credit assignment and compositional generalization in complex multi-entity tasks.

A hierarchical entity-centric framework for offline goal-conditioned reinforcement learning (GCRL) addresses the key challenge of long-horizon tasks in multi-entity domains under sparse reward and limited (offline) data regimes. By exploiting factorization over entities and leveraging temporally abstracted hierarchy, such frameworks enable efficient credit assignment, improved reachability, and compositional generalization in robotic manipulation and navigation environments. Core advances include the modular synergy of value-based RL with factored subgoal generation—often instantiated with diffusion models or conditional generative models—and the use of rigorous value-based selection and filtering mechanisms to guide high-level planning.

1. Formal Problem Setting and Entity Factorization

The overall setting considers an offline goal-conditioned Markov Decision Process (MDP) $\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, \gamma, p_0, p)$ augmented with a goal-space $\mathcal{G} = \mathcal{S}$ . The agent receives a sparse goal-conditioned reward, typically $r(s, g) = 0$ if $s = g$ and $-1$ otherwise. States are structured as factored tuples $s = \{s_m\}_{m=1}^M$ , each $s_m$ encoding a specific object or the robot, and goals are similarly factored $g = \{g_m\}_{m=1}^M$ . The task is to learn a policy $\pi(a|s,g)$ maximizing expected cumulative reward, using only an offline trajectory dataset $\mathcal{D}$ collected under unknown policies.

This factored representation underlies entity-centric design, as each entity $e \in \mathcal{E}$ (object or agent) becomes a planning primitive. The state and goal factorization is critical for scalable abstraction and tractability in high-dimensional settings with many objects or agents (Haramati et al., 2 Feb 2026).

2. Hierarchical Architecture with Entity-Centric Abstraction

Hierarchical entity-centric frameworks decompose decision-making into at least two levels:

Low-Level GCRL Agent: A goal-conditioned RL policy $\pi^\ell(a|s,g)$ trained (offline) to reach short-horizon subgoals within the agent's competence radius. The associated value function $V^\ell(s,g)$ and Q-function $Q^\ell(s,a,g)$ are trained with algorithms such as Implicit Q-Learning (IQL), Conservative Q-Learning (CQL), or advantage-weighted regression (AWR) variants, with value networks leveraging per-entity transformer parameterization to reflect object interactions.
High-Level Subgoal Generator: An entity-factored stochastic generator that produces intermediate subgoals in the state (or entity-state) space. Advanced instantiations employ conditional diffusion models or conditional VAEs, mapping $(s,g)$ pairs to plausible subgoals $\tilde{g}$ , with the factorization ensuring that each subgoal typically modifies only a sparse subset of entities, thus supporting compositionality and simplifying low-level policy adaptation (Haramati et al., 2 Feb 2026, Li et al., 2022).
Option-Aware Variant: As elaborated in Option-aware Temporally Abstracted (OTA) frameworks, the high-level policy $\pi^h$ selects temporally extended “options” $\omega \in \Omega_\mathcal{E}$ —each corresponding to an entity-centric macro-action (e.g., “move block $e$ to $x$ ”)—and the corresponding OTA Bellman operator integrates option durations to contract the planning horizon, reducing value estimation noise (2505.12737).

3. Subgoal Generation, Selection, and Value-Based Filtering

Subgoal generation and selection are central mechanisms for bridging long-horizon credit assignment:

Subgoal Proposals: Candidate subgoals are generated from the high-level diffuser (diffusion model) or VAE decoder, conditioned on current state and desired goal, often for a window of $K$ steps. Each candidate acts as a potential intermediate target for the low-level policy.
Value-Based Filtering: Not all candidate subgoals are achievable. The filtering process retains only those subgoals $\tilde{g}_i$ satisfying $V^\ell(s, \tilde{g}_i) > \rho$ (for a threshold $\rho$ ), ensuring reachability within the low-level policy’s competence radius. From the filtered set, the subgoal with maximal value to the final goal, i.e., $\tilde{g}^\star = \arg\max_{\tilde{g}} V^\ell(\tilde{g}, g)$ , is selected (Haramati et al., 2 Feb 2026).
Advantage-Weighted Regression: In OTA and related hierarchical frameworks, advantage signals $A^h(s, \omega, g)$ —computed via temporally abstracted value differences—guide subgoal selection and the training of $\pi^h$ through weighted regression or log-probability maximization, imparting robustness to value estimation errors at large horizons (2505.12737).

4. Training Paradigm and Modular Composition

Offline training proceeds in a modular fashion, with no parameter sharing between the low-level GCRL agent and the high-level subgoal generator:

Low-Level Policy: Trained for short-horizon goal-reaching via batch RL, e.g., IQL or CQL, on tuples $(s,a,s',g)$ , augmented through hindsight relabeling and, for robustness, out-of-distribution goal perturbation using latent variable models (such as a CVAE) (Li et al., 2022).
High-Level Generator: Trained on trajectory data to learn conditional subgoal distributions, e.g., minimizing denoising diffusion loss or VAE ELBO, with explicit entity-wise factorization.
Composition at Inference: At test time, the modules are composed by alternately generating and selecting subgoals through the high-level mechanism, then executing the low-level policy for a fixed number of environment steps toward the selected subgoal. This receding-horizon approach ensures robustness and modularity (Haramati et al., 2 Feb 2026).

5. Theoretical Properties and Ablative Analysis

Entity-centric hierarchical frameworks offer several theoretical advantages:

Horizon Contraction: By decomposing the final goal into a sequence of reachable subgoals (each within $K$ steps), TD-errors and function approximation noise are reduced. Temporally abstracted value updates, as in OTA, further contract the planning horizon by a factor of the option duration, increasing monotonicity and sign correctness of the advantage signal (2505.12737).
Entity Sparsity and Decomposability: Factored subgoals typically alter only a small subset of the entities per intervention, reducing the variance of the low-level controller and enabling efficient learning and generalization. Empirically, HECRL’s diffuser modifies on average 1.36 of 3 cubes per subgoal in 3-cube environments, compared to near-complete scene changes in unfactored alternatives (Haramati et al., 2 Feb 2026).
Critic Lower Bounds and OOD Robustness: The use of conservative critics (CQL) and goal perturbation penalties ensures the value function is a lower bound away from data support, discouraging selection of unreachable subgoals and improving safety in hierarchical planning (Li et al., 2022).

6. Experimental Evaluation and Empirical Performance

Comprehensive benchmarks across robotic manipulation and navigation tasks validate the hierarchical entity-centric approach:

Environment	EC-SGIQL (Haramati et al., 2 Feb 2026)	EC-IQL	HIQL	IQL
PPP-Cube (state)	82.5 ± 3.1	51.5	48.3	34.3
Stack-Cube (state)	43.5 ± 1.9	29.0	0.0	19.3
PPP-Cube (image)	64.3 ± 4.9	25.0	0.0	0.0
Scene (image)	61.5 ± 5.9	53.0	8.3	17.5

Relative Gains: EC-SGIQL achieves over 150% improvement on the hardest image-based task (PPP-Cube) compared to the best non-hierarchical baseline.
Order-Consistency and Monotonicity: OTA-based methods yield higher order-consistency measures ( $r^c = 0.94$ –$0.98$) compared to HIQL ($0.70$–$0.90$), indicating more stable value estimates (2505.12737).
Generalization: The hierarchical, modular approach supports meaningful zero-shot generalization to more entities (e.g., 4–6 cubes) and increasing task horizon, with performance remaining robust under hyperparameter changes (Haramati et al., 2 Feb 2026).

7. Extensions and Significance for Multi-Level Hierarchies

Entity-centric abstraction naturally extends to multi-level hierarchies:

Task-Level: Select which object or sub-environment to manipulate.
Entity-Level: Choose geometric or semantic subgoals for each object.
Action-Level: Execute primitive controls to realize subgoal completion.

At each level, temporally abstracted value learning (as in OTA) and factored subgoal generation contract the horizon and enhance scalability. These designs are directly compatible with graph neural network embeddings and are agnostic to the choice of base GCRL algorithm, allowing integration with IQL, CQL, AWR, and diffusion-based generative models (2505.12737, Haramati et al., 2 Feb 2026, Li et al., 2022). This approach establishes a rigorous foundation for compositional, scalable, and robust policy synthesis in offline goal-conditioned RL across diverse, multi-entity environments.

Markdown Report Issue Upgrade to Chat

References (3)

Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion (2026)

Hierarchical Planning Through Goal-Conditioned Offline Reinforcement Learning (2022)

Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Entity-Centric Framework for Offline Goal-Conditioned Reinforcement Learning (GCRL).