HOWM: Homomorphic Object-Oriented Model
- HOWM is a compositional world model that employs a differentiable MDP homomorphism to capture object-oriented environments efficiently.
- It integrates Slot Attention, learned action binding, and equivariant GNNs to enable scalable and soft compositional generalization.
- Empirical results in object library environments demonstrate competitive generalization with significantly reduced computational resources compared to exact models.
A Homomorphic Object-oriented World Model (HOWM) is a world modeling approach for object-oriented environments that emphasizes compositional generalization through a differentiable approximation of Markov Decision Process (MDP) homomorphism. HOWM is motivated by the algebraic formalization of compositional generalization, and is specifically designed to provide efficient, scalable modeling and prediction in settings where scenes are comprised of variable subsets of objects drawn from a larger object library. By combining Slot Attention, learned action binding, and equivariant graph neural networks (GNNs) within an end-to-end architecture, HOWM enables "soft" compositional generalization that approaches equivariant performance at substantially reduced computational cost compared to exact implementations (Zhao et al., 2022).
1. Algebraic Foundations of Compositional Generalization
Compositional generalization in object-oriented environments is formalized using an algebraic framework. The environment is modeled as an MDP:
State and action spaces, and , are factorized over a library of objects:
At most objects are present in any scene, with . Transition dynamics are governed by:
A reduced "slot" MDP, , is introduced to compactly represent only present objects via
A homomorphism mapping
is defined with as a state projection and as a (state-dependent) action projection.
2. Homomorphism and Equivariance in Dynamics
The core property of the homomorphism is that, for all , , and any ,
This ensures that the reduced model's transitions aggregate (via summation) all full-model state transitions mapping to a given reduced state, preserving consistency of dynamics.
Equivalently, by interpreting object-replacement as a permutation acting on both states and actions, exact compositional generalization requires the transition function to be -equivariant:
This encodes exact invariance to object identity permutations, guaranteeing model predictions are unaffected by object ordering or labeling.
3. HOWM Architecture: Slot Attention, Action Binding, and Equivariant GNN
HOWM implements a soft homomorphism in a structured three-stage architecture:
- (a) Object Extraction:
Slot Attention (Locatello et al.) encodes each image into object slots,
along with a background slot.
- (b) Action Attention:
A learned binding matrix matches object slots to object-indexed action channels:
where is the identity for objects, and are learned projections. The action for each slot is then
- (c) Equivariant Transition Model:
A -equivariant GNN, , predicts next-step slots:
- (d) Aligned Contrastive Loss:
Since slots are unordered, predicted and true slots are "lifted" into -slot space using the pseudoinverse :
An aligned, contrastive-structured loss evaluates prediction error:
with denoting stop-gradient to restrict influence to the attention module.
4. Object Library Environments and Evaluation Metrics
HOWM is evaluated in a family of object-oriented RL environments termed "Object Library," parameterized by the library size and the number of present objects . Each episode samples objects from ; an image is observed, and a factorized action controls exactly one object. Two main instances are:
- Basic Shapes:
Objects differ in color, shape, and size, sharing cardinal-direction actions.
- Rush Hour:
Objects are oriented cars; actions are relative to each car's heading.
Train and test sets are disjoint in scene composition (), but every individual object is seen during training.
Performance is measured via multi-step Prediction as Ranking using Hits@1 (H@1), Mean Reciprocal Rank (MRR), and a generalization gap (train MRR minus test MRR).
5. Comparative Experimental Results
Empirical comparisons highlight the trade-off between generalization, complexity, and scalability:
| Model | Complexity | One-step Test MRR (Shapes ) | Five-step Test MRR (Shapes ) | GPU Mem () |
|---|---|---|---|---|
| -CSWM (exact CG) | 100% | Near 100% | 8.1 GB | |
| HOWM (soft CG) | 98.5–99.7% | 75.1–81.8% | 3.7 GB | |
| No-CG Baselines | Varies | Poor, with large gaps as | Poor, large gap | Varies |
On "Rush Hour" (), HOWM maintains substantially better generalization (five-step test MRR 84.3%, 63.2%, 65.3%; gap 11.2%, 31.1%, 31.3%) than -slot-only or non-equivariant baselines. -CSWM achieves near-perfect generalization but exhibits prohibitive resource usage for large (out of memory for ), while HOWM supports larger object libraries efficiently by keeping complexity in .
6. Theoretical Analysis: Exact vs. Soft Homomorphic Generalization
A key theoretical result (Proposition: scaled equivariance error) establishes that if a full-MDP model has equivariance error under , and the homomorphism gives a valid projection, then the induced slot-MDP model has scaled equivariance error . In the case of , perfect generalization is achieved in both representations.
A further corollary demonstrates that perfect -equivariance in the -slot model plus a valid implies perfect -equivariance in the full MDP, supporting the lifting of exact compositional generalization.
There is a fundamental efficiency trade-off: exact -equivariant models require edges and slots; HOWM, using slots and edges, is scalable for while only incurring "soft" compositional generalization (a nonzero but moderate generalization gap). A plausible implication is that learned attention-based binding in the latent space yields practical generalization at a fraction of the resource cost (Zhao et al., 2022).
7. Significance and Outlook
HOWM provides an instantiation of a differentiable, approximate MDP homomorphism for object-oriented modeling in compositional RL settings. By combining Slot Attention, a learned action binding mechanism, and a -equivariant GNN with an alignment-based contrastive loss, HOWM achieves strong compositional generalization on held-out object combinations while supporting scalability to object libraries far larger than are tractable for exact -equivariant baselines. This suggests utility for future RL agents seeking sample-efficient transfer across combinatorially diverse object compositions (Zhao et al., 2022).