Group-Structured World Models

Updated 21 November 2025

Group-structured world models are representation learning frameworks that embed symmetry group actions in latent spaces to ensure equivariance, disentanglement, and interpretability.
They combine encoders, group representation modules, and decoders with losses like reconstruction and contrastive losses to model state transitions accurately.
Applications span reinforcement learning, object-centric modeling, and fast adaptation in simulators, demonstrating improved predictive fidelity and efficient exploration.

Group-structured world models are a class of representation learning frameworks in which the dynamics of an agent’s internal model are explicitly structured according to the action of a symmetry group. These models encode world states into latent spaces where group actions (such as translations, rotations, or compositional transformations) govern state transitions, promoting invariance, equivariance, and disentanglement in downstream predictive tasks. The fundamental premise is that many physical, geometric, and compositional regularities in observed environments can be formalized as group actions, and embedding this structure directly into world models yields both improved predictive fidelity and interpretability.

1. Theoretical Foundations: Groups and Their Latent Actions

At the core of group-structured world models is the assumption that the environment is endowed with a group $G$ of symmetry transformations acting on its state or observation space $X$ via $g \cdot x$ for $g \in G$ and $x \in X$ . The modeling goal is to learn a latent encoding $f:X \to V$ (typically $V \cong \mathbb{R}^n$ ) and a group representation $\rho: G \to GL(V)$ such that equivariance is satisfied: $f(g \cdot x) = \rho(g) f(x)$ This structure allows the effect of physical actions or abstract interventions to be modeled as linear, often low-dimensional, transformations in the latent space, with the structure of $G$ (e.g., composition, inversion) mirrored algebraically by $\rho$ . For compositional symmetries, such as $G = G_1 \times G_2 \times \cdots \times G_k$ , disentangled representations decompose the latent $V = V_1 \oplus \cdots \oplus V_k$ , with each group factor $G_i$ acting non-trivially only on $V_i$ (Quessard et al., 2020).

For continuous control and smooth latent evolution, Lie groups and their associated Lie algebras are used: the latent dynamics are then parameterized by exponentials of block-diagonal algebra elements, yielding interpretable scaling and rotation blocks (e.g., for abelian $(\mathbb{R} \times S^1)^{N \times J}$ actions) (Hayashi et al., 13 Mar 2025). This machinery provides compositionality (sequential actions correspond to matrix multiplication), ensures invertibility, and allows the learning of continuous, compositional controls.

2. Model Architectures: Encoders, Latent Transitions, and Decoders

A general group-structured world model is constructed using the following workflow:

Encoder: Maps observations (images, states, sequences) $o_t$ into unit-norm latent vectors $z_t = f_{\phi}(o_t) \in S^{n-1}$ or partitioned object-centric slots $\{z^k\}$ (Quessard et al., 2020, Hayashi et al., 13 Mar 2025, Kipf et al., 2019).
Action/Group Representation Module: For each action $a_t$ , a group generator $g_{a_t} \in G$ is parameterized (lookup, small MLP, or learnable exponential mapping from an efference copy signal), producing group matrices $\rho(g_{a_t})$ or block-exponential matrices (Keurti et al., 2022, Hayashi et al., 13 Mar 2025).
Latent Transition: The next latent state is computed via the group action: $z_{t+1} = \rho(g_{a_t}) z_t$ (linear), or in object-centric approaches, via message-passing GNNs for inter-object relations (Kipf et al., 2019). In abstract MDPs, $f(z,a) = z \oplus \Delta(z,a)$ , with $\Delta$ respecting modular arithmetic and Euclidean increments (Delliaux et al., 2 Jun 2025).
Decoder (if present): Reconstructs the next observation, $o_{t+1} = d_{\psi}(z_{t+1})$ , or uses predictive contrastive losses in latent space if no pixel reconstruction is performed.

This general structure supports both pixel-level models and fully abstract models that operate only in low-dimensional latent manifolds, and allows seamless integration of prior geometric knowledge by appropriate choice of group $G$ and latent space topology (Delliaux et al., 2 Jun 2025).

3. Objective Functions: Equivariance, Disentanglement, and Predictive Losses

Training objective functions explicitly encode the requirement of equivariance and, where desirable, encourage disentanglement. Key losses include:

Reconstruction Loss: Enforces reconstruction accuracy for one or multiple steps; for example, $\mathcal{L}_{rec} = \sum_k \|o_{t+k} - \hat{o}_{t+k}\|^2$ (Quessard et al., 2020).
Equivariance Losses: Penalize deviations from latent equivariance, e.g., $\|h(o_{t+1}) - \rho(g_t) h(o_t)\|^2$ and higher-order (two-step) compositions to enforce the homomorphism property, $\rho(g_1 g_2) = \rho(g_1)\rho(g_2)$ (Keurti et al., 2022).
Disentanglement Regularizers: Encourage each group generator $g_a$ to act primarily on a minimal latent subspace, e.g., penalizing the off-dominant plane rotations, $\mathcal{L}_{ent}$ , or inducing block-diagonal structure via sparsity (Quessard et al., 2020, Keurti et al., 2022).
Contrastive Losses: For settings without pixel reconstruction, InfoNCE or hinge losses are used in latent space to align predicted and true latents and repel negative samples: e.g., $\mathcal{L}_{NCE}$ as in (Kipf et al., 2019, Delliaux et al., 2 Jun 2025).
Auxiliary Losses: Additional regularizers constrain the latent transition magnitude, reward prediction error, or enforce group compositionality and inversion for learned Lie actions (Hayashi et al., 13 Mar 2025, Delliaux et al., 2 Jun 2025).

The overall losses allow end-to-end training of encoder, group action representation, and decoder components, while inducing geometric and algebraic structure in the latent predictive process.

4. Exemplars and Empirical Results: Recovering Symmetry and Generalization

Experimental evaluation has consistently demonstrated the utility of group-structured world models in environments with explicit and implicit symmetry:

Flatland Torus (Quessard et al., 2020): Pixel or one-hot states on a toroidal grid with $G= \mathbb{C}_p \times \mathbb{C}_p$ . The model recovers a latent torus structure, achieving long-horizon predictive accuracy not matched by β-VAE or unstructured models.
3D Object and Color Cycles (Quessard et al., 2020): In SO(3) × $\mathbb{C}_5$ scenarios (rotation and color), the representation cleanly decomposes into spatial and color subspaces.
Curiosity-Driven Exploration (Sergeant-Perthuis et al., 2023): Structuring the internal state space either as Euclidean (SE(3)) or projective ( $PGL_4$ ) groups fundamentally changes epistemic value gradients. Only projective-structured models induce approach behaviors under a curiosity objective, due to local uncertainty volume contraction, confirming the computational effect of geometric group choice.
Object-Centric Models (Kipf et al., 2019, Hayashi et al., 13 Mar 2025): In C-SWMs and WLA, object slots and compositional Lie actions enable modular dynamics, smooth interpolation, and compositionality—e.g., summing group elements to combine action effects. Multi-environment training with shared group structure enables rapid adaptation to novel action sets.
Abstract MDPs and RL (Delliaux et al., 2 Jun 2025): Incorporating known cyclic or compact symmetries into the latent (e.g., rings, torus for translation/rotation) enables drastic improvements—Hits@1 exceeding 85% compared to 15% for unstructured models—and faster RL convergence in model-based DQN, even with very low sample regimes.

These outcomes demonstrate the empirical gains in predictive accuracy, sample efficiency, interpretability, and generalization when world models are structured by group actions.

5. Methods for Learning Group Structure: Parametric and Unsupervised Approaches

Several principal methods are used to learn group-structured world models without explicit supervision of the group $G$ :

Parametric Group Representations (Quessard et al., 2020, Keurti et al., 2022): Learnable MLPs map control or efference signals to Lie algebra elements, from which group matrices are generated via exponentiation (the matrix exponential). Homomorphism is enforced by multi-step prediction and commutativity constraints in latent space.
Contrastive and Predictive Schemes (Kipf et al., 2019, Delliaux et al., 2 Jun 2025): Models trained solely with latent contrastive or predictive losses allow group structure to emerge via pressure to minimize transition uncertainty and maximize consistency across observed sequences.
Object-Centric Slot Disentanglement (Hayashi et al., 13 Mar 2025): The latent space is partitioned into slots, each acted on (independently or jointly) by corresponding group action blocks; group structure is enforced via block-diagonal constraints in the latent action mappings.
Equivariant Autoencoders and Regularization (Keurti et al., 2022, Hayashi et al., 13 Mar 2025): Explicit equivariance loss, regularization for block-diagonal or sparse group action, and compositionality penalties for sequence action effect matching are used to guarantee faithful recovery of latent group algebra representations.
Geometric Priors and Hybrid Latents (Delliaux et al., 2 Jun 2025): Priors about the environment’s symmetry (e.g., cyclic or toroidal topology) are encoded by constructing latent coordinates as periodic or Euclidean, with appropriate group action realized by modular addition.

6. Applications, Impact, and Open Directions

Group-structured world models demonstrate powerful advantages and have been leveraged across domains requiring compositional reasoning, long-horizon prediction, interpretability, and efficient exploration:

Curiosity-Based RL and Active Inference: Embedding non-unimodular (e.g., projective) groups enables intrinsic motivation for approach-like behaviors, converting undirected curiosity into purposeful exploration (Sergeant-Perthuis et al., 2023).
Multi-environment Simulators and Fast Adaptation: Shared group-latent structure allows rapid adaptation to novel environments via lightweight controller networks, supporting sample-efficient sim-to-real and zero-shot adaptation (Hayashi et al., 13 Mar 2025).
Disentangled Object Dynamics: Object-centric factorization, combined with group action composition, yields interpretable modular world models suitable for planning and hybrid symbolic-connectionist architectures (Kipf et al., 2019).
Model-Based RL: Latent MDPs structured by known symmetry groups, even partial, exhibit improved sample efficiency, better generalization to held-out transitions, and are well-suited to hybrid model-free/model-based policy learning (Delliaux et al., 2 Jun 2025).

Notable limitations include the cost of representing general groups (e.g., $O(n^2)$ for $SO(n)$ ), assumptions of invertibility and abelianness, and the challenge of scaling to non-compact or stochastic dynamics. Future research directions include learning group structure from unstructured experience, extending to non-abelian and stochastic environments, and integrating group-structured latents with high-capacity multimodal generative models.

7. Summary Table: Representative Approaches

Model/Ref	Latent Structure	Group Action Mechanism
(Quessard et al., 2020) Quessard et al.	$\mathbb{R}^n$ (sphere)	SO(n) block-rotation, disentanglement loss
(Keurti et al., 2022) Homomorphism AE	$\mathbb{R}^D$	Learnable $\exp(\phi(g))$ , equiv. loss
(Hayashi et al., 13 Mar 2025) WLA	Slot-wise $\mathbb{R}^{2NJ}$	Lie block-exponential, object-centric
(Kipf et al., 2019) C-SWM	$K$ objects $\times \mathbb{R}^D$	GNN message passing, contrastive loss
(Delliaux et al., 2 Jun 2025) Latent MDP	$(\mathbb{R}/k\mathbb{Z})^d \times \cdots$	Modular $\oplus$ in latent, InfoNCE
(Sergeant-Perthuis et al., 2023) Projective agent model	$\mathbb{R}^3$ , $P_3(\mathbb{R})$	SE(3) or PGL $_4$ , curiosity-based action

Each approach demonstrates the theoretical and empirical strengths of imposing group structure on latent world models, enabling more transparent, compositional, and generalizable representations of dynamical environments.