Papers
Topics
Authors
Recent
Search
2000 character limit reached

HOWM: Homomorphic Object-Oriented Model

Updated 14 January 2026
  • HOWM is a compositional world model that employs a differentiable MDP homomorphism to capture object-oriented environments efficiently.
  • It integrates Slot Attention, learned action binding, and equivariant GNNs to enable scalable and soft compositional generalization.
  • Empirical results in object library environments demonstrate competitive generalization with significantly reduced computational resources compared to exact models.

A Homomorphic Object-oriented World Model (HOWM) is a world modeling approach for object-oriented environments that emphasizes compositional generalization through a differentiable approximation of Markov Decision Process (MDP) homomorphism. HOWM is motivated by the algebraic formalization of compositional generalization, and is specifically designed to provide efficient, scalable modeling and prediction in settings where scenes are comprised of variable subsets of objects drawn from a larger object library. By combining Slot Attention, learned action binding, and equivariant graph neural networks (GNNs) within an end-to-end architecture, HOWM enables "soft" compositional generalization that approaches equivariant performance at substantially reduced computational cost compared to exact implementations (Zhao et al., 2022).

1. Algebraic Foundations of Compositional Generalization

Compositional generalization in object-oriented environments is formalized using an algebraic framework. The environment is modeled as an MDP:

M=(S,A,T,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, T, R, \gamma)

State and action spaces, S\mathcal{S} and A\mathcal{A}, are factorized over a library of NN objects:

S=S1××SN,A=A1××AN\mathcal{S} = \mathcal{S}_1 \times \cdots \times \mathcal{S}_N,\quad \mathcal{A} = \mathcal{A}_1 \times \cdots \times \mathcal{A}_N

At most KK objects are present in any scene, O{1,...,N}O \subseteq \{1, ..., N\} with O=K|O| = K. Transition dynamics are governed by:

T:S×A×SR0T : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}_{\geq 0}

A reduced "slot" MDP, M=(S,A,T,R,γ)\overline{\mathcal{M}} = (\overline{\mathcal{S}}, \overline{\mathcal{A}}, \overline{T}, \overline{R}, \gamma), is introduced to compactly represent only KK present objects via

S=Si1××SiK,A=Ai1××AiK\overline{\mathcal{S}} = \mathcal{S}_{i_1} \times \cdots \times \mathcal{S}_{i_K},\quad \overline{\mathcal{A}} = \mathcal{A}_{i_1} \times \cdots \times \mathcal{A}_{i_K}

A homomorphism mapping

h=(ϕ,{αssS}):MMh = \left(\phi, \{\alpha_s\,|\, s \in \mathcal{S} \} \right) : \mathcal{M} \rightarrow \overline{\mathcal{M}}

is defined with ϕ:SS\phi: \mathcal{S} \rightarrow \overline{\mathcal{S}} as a state projection and αs:AA\alpha_s: \mathcal{A} \rightarrow \overline{\mathcal{A}} as a (state-dependent) action projection.

2. Homomorphism and Equivariance in Dynamics

The core property of the homomorphism is that, for all ss, aa, and any sS\overline{s}' \in \overline{\mathcal{S}},

T(sϕ(s),αs(a))=sϕ1(s)T(ss,a)\overline{T}( \overline{s}' \mid \phi(s), \alpha_s(a) ) = \sum_{s'' \in \phi^{-1}(\overline{s}')} T(s'' \mid s, a)

This ensures that the reduced model's transitions aggregate (via summation) all full-model state transitions mapping to a given reduced state, preserving consistency of dynamics.

Equivalently, by interpreting object-replacement as a permutation σΣN\sigma \in \Sigma_N acting on both states and actions, exact compositional generalization requires the transition function to be ΣN\Sigma_N-equivariant:

T(σ.sσ.s,σ.a)=T(ss,a),σΣNT( \sigma.s' \mid \sigma.s,\, \sigma.a ) = T(s' \mid s, a),\quad \forall \sigma \in \Sigma_N

This encodes exact invariance to object identity permutations, guaranteeing model predictions are unaffected by object ordering or labeling.

3. HOWM Architecture: Slot Attention, Action Binding, and Equivariant GNN

HOWM implements a soft homomorphism in a structured three-stage architecture:

  • (a) Object Extraction:

Slot Attention (Locatello et al.) encodes each image sts_t into KK object slots,

sˉt=(st(1),,st(K))\bar s_t = (s_t^{(1)}, \ldots, s_t^{(K)})

along with a background slot.

  • (b) Action Attention:

A learned binding matrix MtRK×NM_t \in \mathbb{R}^{K \times N} matches object slots to object-indexed action channels:

[Mt]k,i=softmaxk(1Dk(IdN)q(st(k))T)i[M_t]_{k, i} = \mathrm{softmax}_k\Bigg( \frac{1}{\sqrt{D}}\,k(\mathrm{Id}_N) q(s_t^{(k)})^T \Bigg)_i

where IdN\mathrm{Id}_N is the identity for objects, and (k,q)(k, q) are learned projections. The action for each slot is then

aˉt=MtatRK\bar a_t = M_t\, a_t \in \mathbb{R}^K

  • (c) Equivariant Transition Model:

A ΣK\Sigma_K-equivariant GNN, TθT_\theta, predicts next-step slots:

sˉ^t+1=Tθ(sˉt,aˉt)\hat{\bar s}_{t+1} = T_\theta(\bar s_t, \bar a_t)

  • (d) Aligned Contrastive Loss:

Since slots are unordered, predicted and true slots are "lifted" into NN-slot space using the pseudoinverse Mt+M_t^+:

st=Mt+sˉt,st+1=Mt+1+sˉt+1s_t^\uparrow = M_t^+ \bar s_t,\quad s_{t+1}^\uparrow = M_{t+1}^+ \bar s_{t+1}

An aligned, contrastive-structured loss evaluates prediction error:

L+(st,st+1)=NG(Mt+1+)sˉt+1NG(Mt+)Tθ(sˉt,aˉt)2\mathcal{L}^+(s_t, s_{t+1}) = \biggl\|\, \mathtt{NG}(M_{t+1}^+) \bar s_{t+1} - \mathtt{NG}(M_t^+) T_\theta( \bar s_t, \bar a_t ) \biggr\|^2

with NG()\mathtt{NG}(\cdot) denoting stop-gradient to restrict influence to the attention module.

4. Object Library Environments and Evaluation Metrics

HOWM is evaluated in a family of object-oriented RL environments termed "Object Library," parameterized by the library size NN and the number of present objects KK. Each episode samples KK objects from NN; an image ss is observed, and a factorized action aa controls exactly one object. Two main instances are:

  • Basic Shapes:

Objects differ in color, shape, and size, sharing cardinal-direction actions.

  • Rush Hour:

Objects are oriented cars; actions are relative to each car's heading.

Train and test sets are disjoint in scene composition (OtrainOtest=O_\mathrm{train} \cap O_\mathrm{test} = \emptyset), but every individual object is seen during training.

Performance is measured via multi-step Prediction as Ranking using Hits@1 (H@1), Mean Reciprocal Rank (MRR), and a generalization gap (train MRR minus test MRR).

5. Comparative Experimental Results

Empirical comparisons highlight the trade-off between generalization, complexity, and scalability:

Model Complexity One-step Test MRR (Shapes K ⁣= ⁣5K\!=\!5) Five-step Test MRR (Shapes N=20N=20) GPU Mem (N=20N=20)
ΣN\Sigma_N-CSWM (exact CG) O(N2)O(N^2) \approx 100% Near 100% 8.1 GB
HOWM (soft CG) O(K2)O(K^2) 98.5–99.7% 75.1–81.8% 3.7 GB
No-CG Baselines Varies Poor, with large gaps as NN\uparrow Poor, large gap Varies

On "Rush Hour" (N=5,10,20N=5,10,20), HOWM maintains substantially better generalization (five-step test MRR \approx 84.3%, 63.2%, 65.3%; gap \approx 11.2%, 31.1%, 31.3%) than KK-slot-only or non-equivariant baselines. ΣN\Sigma_N-CSWM achieves near-perfect generalization but exhibits prohibitive resource usage for large NN (out of memory for N20N\gtrsim 20), while HOWM supports larger object libraries efficiently by keeping complexity in KK.

6. Theoretical Analysis: Exact vs. Soft Homomorphic Generalization

A key theoretical result (Proposition: scaled equivariance error) establishes that if a full-MDP model T^\hat T has equivariance error λL\lambda_L under ΣN\Sigma_N, and the homomorphism hh gives a valid projection, then the induced slot-MDP model T^\hat{\overline{T}} has scaled equivariance error λ[K]=(NK)λL\lambda_{[K]} = \binom{N}{K} \lambda_L. In the case of λL=0\lambda_L=0, perfect generalization is achieved in both representations.

A further corollary demonstrates that perfect ΣK\Sigma_K-equivariance in the KK-slot model plus a valid hh implies perfect ΣN\Sigma_N-equivariance in the full MDP, supporting the lifting of exact compositional generalization.

There is a fundamental efficiency trade-off: exact ΣN\Sigma_N-equivariant models require O(N2)O(N^2) edges and NN slots; HOWM, using KK slots and O(K2)O(K^2) edges, is scalable for NKN\gg K while only incurring "soft" compositional generalization (a nonzero but moderate generalization gap). A plausible implication is that learned attention-based binding in the latent space yields practical generalization at a fraction of the resource cost (Zhao et al., 2022).

7. Significance and Outlook

HOWM provides an instantiation of a differentiable, approximate MDP homomorphism for object-oriented modeling in compositional RL settings. By combining Slot Attention, a learned action binding mechanism, and a ΣK\Sigma_K-equivariant GNN with an alignment-based contrastive loss, HOWM achieves strong compositional generalization on held-out object combinations while supporting scalability to object libraries far larger than are tractable for exact ΣN\Sigma_N-equivariant baselines. This suggests utility for future RL agents seeking sample-efficient transfer across combinatorially diverse object compositions (Zhao et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Homomorphic Object-oriented World Model (HOWM).