HOWM: Homomorphic Object-Oriented Model

Updated 14 January 2026

HOWM is a compositional world model that employs a differentiable MDP homomorphism to capture object-oriented environments efficiently.
It integrates Slot Attention, learned action binding, and equivariant GNNs to enable scalable and soft compositional generalization.
Empirical results in object library environments demonstrate competitive generalization with significantly reduced computational resources compared to exact models.

A Homomorphic Object-oriented World Model (HOWM) is a world modeling approach for object-oriented environments that emphasizes compositional generalization through a differentiable approximation of Markov Decision Process (MDP) homomorphism. HOWM is motivated by the algebraic formalization of compositional generalization, and is specifically designed to provide efficient, scalable modeling and prediction in settings where scenes are comprised of variable subsets of objects drawn from a larger object library. By combining Slot Attention, learned action binding, and equivariant graph neural networks (GNNs) within an end-to-end architecture, HOWM enables "soft" compositional generalization that approaches equivariant performance at substantially reduced computational cost compared to exact implementations (Zhao et al., 2022).

1. Algebraic Foundations of Compositional Generalization

Compositional generalization in object-oriented environments is formalized using an algebraic framework. The environment is modeled as an MDP:

$\mathcal{M} = (\mathcal{S}, \mathcal{A}, T, R, \gamma)$

State and action spaces, $\mathcal{S}$ and $\mathcal{A}$ , are factorized over a library of $N$ objects:

$\mathcal{S} = \mathcal{S}_1 \times \cdots \times \mathcal{S}_N,\quad \mathcal{A} = \mathcal{A}_1 \times \cdots \times \mathcal{A}_N$

At most $K$ objects are present in any scene, $O \subseteq \{1, ..., N\}$ with $|O| = K$ . Transition dynamics are governed by:

$T : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}_{\geq 0}$

A reduced "slot" MDP, $\overline{\mathcal{M}} = (\overline{\mathcal{S}}, \overline{\mathcal{A}}, \overline{T}, \overline{R}, \gamma)$ , is introduced to compactly represent only $K$ present objects via

$\overline{\mathcal{S}} = \mathcal{S}_{i_1} \times \cdots \times \mathcal{S}_{i_K},\quad \overline{\mathcal{A}} = \mathcal{A}_{i_1} \times \cdots \times \mathcal{A}_{i_K}$

A homomorphism mapping

$h = \left(\phi, \{\alpha_s\,|\, s \in \mathcal{S} \} \right) : \mathcal{M} \rightarrow \overline{\mathcal{M}}$

is defined with $\phi: \mathcal{S} \rightarrow \overline{\mathcal{S}}$ as a state projection and $\alpha_s: \mathcal{A} \rightarrow \overline{\mathcal{A}}$ as a (state-dependent) action projection.

2. Homomorphism and Equivariance in Dynamics

The core property of the homomorphism is that, for all $s$ , $a$ , and any $\overline{s}' \in \overline{\mathcal{S}}$ ,

$\overline{T}( \overline{s}' \mid \phi(s), \alpha_s(a) ) = \sum_{s'' \in \phi^{-1}(\overline{s}')} T(s'' \mid s, a)$

This ensures that the reduced model's transitions aggregate (via summation) all full-model state transitions mapping to a given reduced state, preserving consistency of dynamics.

Equivalently, by interpreting object-replacement as a permutation $\sigma \in \Sigma_N$ acting on both states and actions, exact compositional generalization requires the transition function to be $\Sigma_N$ -equivariant:

$T( \sigma.s' \mid \sigma.s,\, \sigma.a ) = T(s' \mid s, a),\quad \forall \sigma \in \Sigma_N$

This encodes exact invariance to object identity permutations, guaranteeing model predictions are unaffected by object ordering or labeling.

3. HOWM Architecture: Slot Attention, Action Binding, and Equivariant GNN

HOWM implements a soft homomorphism in a structured three-stage architecture:

(a) Object Extraction:

Slot Attention (Locatello et al.) encodes each image $s_t$ into $K$ object slots,

$\bar s_t = (s_t^{(1)}, \ldots, s_t^{(K)})$

along with a background slot.

(b) Action Attention:

A learned binding matrix $M_t \in \mathbb{R}^{K \times N}$ matches object slots to object-indexed action channels:

$[M_t]_{k, i} = \mathrm{softmax}_k\Bigg( \frac{1}{\sqrt{D}}\,k(\mathrm{Id}_N) q(s_t^{(k)})^T \Bigg)_i$

where $\mathrm{Id}_N$ is the identity for objects, and $(k, q)$ are learned projections. The action for each slot is then

$\bar a_t = M_t\, a_t \in \mathbb{R}^K$

(c) Equivariant Transition Model:

A $\Sigma_K$ -equivariant GNN, $T_\theta$ , predicts next-step slots:

$\hat{\bar s}_{t+1} = T_\theta(\bar s_t, \bar a_t)$

(d) Aligned Contrastive Loss:

Since slots are unordered, predicted and true slots are "lifted" into $N$ -slot space using the pseudoinverse $M_t^+$ :

$s_t^\uparrow = M_t^+ \bar s_t,\quad s_{t+1}^\uparrow = M_{t+1}^+ \bar s_{t+1}$

An aligned, contrastive-structured loss evaluates prediction error:

$\mathcal{L}^+(s_t, s_{t+1}) = \biggl\|\, \mathtt{NG}(M_{t+1}^+) \bar s_{t+1} - \mathtt{NG}(M_t^+) T_\theta( \bar s_t, \bar a_t ) \biggr\|^2$

with $\mathtt{NG}(\cdot)$ denoting stop-gradient to restrict influence to the attention module.

4. Object Library Environments and Evaluation Metrics

HOWM is evaluated in a family of object-oriented RL environments termed "Object Library," parameterized by the library size $N$ and the number of present objects $K$ . Each episode samples $K$ objects from $N$ ; an image $s$ is observed, and a factorized action $a$ controls exactly one object. Two main instances are:

Basic Shapes:

Objects differ in color, shape, and size, sharing cardinal-direction actions.

Rush Hour:

Objects are oriented cars; actions are relative to each car's heading.

Train and test sets are disjoint in scene composition ( $O_\mathrm{train} \cap O_\mathrm{test} = \emptyset$ ), but every individual object is seen during training.

Performance is measured via multi-step Prediction as Ranking using Hits@1 (H@1), Mean Reciprocal Rank (MRR), and a generalization gap (train MRR minus test MRR).

5. Comparative Experimental Results

Empirical comparisons highlight the trade-off between generalization, complexity, and scalability:

Model	Complexity	One-step Test MRR (Shapes $K\!=\!5$ )	Five-step Test MRR (Shapes $N=20$ )	GPU Mem ( $N=20$ )
$\Sigma_N$ -CSWM (exact CG)	$O(N^2)$	$\approx$ 100%	Near 100%	8.1 GB
HOWM (soft CG)	$O(K^2)$	98.5–99.7%	75.1–81.8%	3.7 GB
No-CG Baselines	Varies	Poor, with large gaps as $N\uparrow$	Poor, large gap	Varies

On "Rush Hour" ( $N=5,10,20$ ), HOWM maintains substantially better generalization (five-step test MRR $\approx$ 84.3%, 63.2%, 65.3%; gap $\approx$ 11.2%, 31.1%, 31.3%) than $K$ -slot-only or non-equivariant baselines. $\Sigma_N$ -CSWM achieves near-perfect generalization but exhibits prohibitive resource usage for large $N$ (out of memory for $N\gtrsim 20$ ), while HOWM supports larger object libraries efficiently by keeping complexity in $K$ .

6. Theoretical Analysis: Exact vs. Soft Homomorphic Generalization

A key theoretical result (Proposition: scaled equivariance error) establishes that if a full-MDP model $\hat T$ has equivariance error $\lambda_L$ under $\Sigma_N$ , and the homomorphism $h$ gives a valid projection, then the induced slot-MDP model $\hat{\overline{T}}$ has scaled equivariance error $\lambda_{[K]} = \binom{N}{K} \lambda_L$ . In the case of $\lambda_L=0$ , perfect generalization is achieved in both representations.

A further corollary demonstrates that perfect $\Sigma_K$ -equivariance in the $K$ -slot model plus a valid $h$ implies perfect $\Sigma_N$ -equivariance in the full MDP, supporting the lifting of exact compositional generalization.

There is a fundamental efficiency trade-off: exact $\Sigma_N$ -equivariant models require $O(N^2)$ edges and $N$ slots; HOWM, using $K$ slots and $O(K^2)$ edges, is scalable for $N\gg K$ while only incurring "soft" compositional generalization (a nonzero but moderate generalization gap). A plausible implication is that learned attention-based binding in the latent space yields practical generalization at a fraction of the resource cost (Zhao et al., 2022).

7. Significance and Outlook

HOWM provides an instantiation of a differentiable, approximate MDP homomorphism for object-oriented modeling in compositional RL settings. By combining Slot Attention, a learned action binding mechanism, and a $\Sigma_K$ -equivariant GNN with an alignment-based contrastive loss, HOWM achieves strong compositional generalization on held-out object combinations while supporting scalability to object libraries far larger than are tractable for exact $\Sigma_N$ -equivariant baselines. This suggests utility for future RL agents seeking sample-efficient transfer across combinatorially diverse object compositions (Zhao et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Toward Compositional Generalization in Object-Oriented World Modeling (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Homomorphic Object-oriented World Model (HOWM).