Papers
Topics
Authors
Recent
2000 character limit reached

Flow Equivariant World Models (FloWM)

Updated 9 January 2026
  • FloWM is a neural model that embeds continuous Lie group symmetries into its latent dynamics to ensure stable and accurate predictions.
  • It integrates equivariant encoder–process–decoder architectures and recurrent memory modules to enhance sample efficiency and long-horizon performance.
  • The design explicitly enforces invariances like translation and rotation, yielding superior performance in fluid simulations and embodied world modeling.

Flow Equivariant World Models (FloWM) are a class of neural architectures that hard-code continuous symmetry principles into their latent state transition and memory mechanisms, enabling efficient and stable learning of dynamics in high-dimensional, partially observed, and symmetrically structured environments. Grounded in the theory of Lie groups and geometric deep learning, FloWM replaces ad hoc or data-driven symmetry learning with architectures explicitly equivariant to known transformations inherent to the environment, such as translations and rotations. These models have been shown to deliver substantial improvements in sample efficiency, long-horizon stability, and generalization compared to non-equivariant baselines and diffusion-based generative models (Shankar et al., 2023, Lillemark et al., 3 Jan 2026, Keller, 20 Jul 2025).

1. Mathematical Foundations: Lie Groups, Flows, and Equivariance

FloWM exploits the mathematical structure of continuous groups, specifically one-parameter Lie group flows, to impose physically and observationally meaningful equivariances. Let GG be a Lie group (e.g., E(2)E(2) for planar motions), and let g(t)=exp(tA)g(t) = \exp(tA) denote a smooth flow parameterized by an infinitesimal generator AgA \in \mathfrak{g}, the Lie algebra of GG. These flows model dynamical invariances such as constant-velocity translations (R2\mathbb{R}^2) or rotations (SO(2)SO(2)), with the group acting on signal spaces XX as gxg \cdot x via a representation ρ(g)\rho(g).

For a signal ff, the fundamental equivariance property to a flow demands that

Φ(gt ⁣ ⁣f)=gt ⁣ ⁣Φ(f)\Phi(g_t \!\cdot\! f) = g_t \!\cdot\! \Phi(f)

for all tRt \in \mathbb{R}, where Φ\Phi denotes the model mapping (e.g., encoder, decoder, transition). This ensures that the model’s predictions undergo the same symmetry transformation as the inputs.

In latent memory models, flow equivariance requires both the latent map and its update rules to commute with group actions induced by both external object motion and agent self-motion. In general, this leads to explicit indexing of latent memory by generator parameters (“velocity channels”) and engineered update operators that consist of group actions (shifts, rolls, permutations) matching the modeled physical symmetries (Lillemark et al., 3 Jan 2026, Keller, 20 Jul 2025).

2. Architectural Design and Symmetry Implementation

2.1 Encode–Process–Decode GNNs for Fluid Flows

For mesh-based fluid environments, FloWM instantiates a multi-scale graph neural network strictly equivariant to E(2)E(2). The architecture follows an Encode–Process–Decode paradigm:

  • Encoder: Constructs a geometric graph on point clouds, initializing node and edge features as irreducible representations (irreps) of E(2)E(2). Edge features use low-order spherical harmonics.
  • Processor: Multiple message-passing layers act at increasing scales. Each layer’s update on node features hi()h_i^{(\ell)} is performed via equivariant linear combinations and tensor products, mixing irreps strictly by allowed coupling rules. Nonlinearities are gated by h()\|h^{(\ell)}\| norms to preserve equivariance.
  • Multi-Scale Pooling/Unpooling: A U-Net-style coarse-to-fine path enables efficient long-range interaction, with features propagated across graph resolutions.
  • Decoder: An equivariant linear layer maps processed features to the output (e.g., velocity increment δu\delta u), followed by an explicit Euler step.

This design hard-wires translation and rotation equivariance throughout the architecture, enforced at every layer (Shankar et al., 2023).

2.2 Recurrent Latent Memory for Partially Observed Worlds

For embodied and video domains, FloWM uses a spatially structured latent memory indexed over spatial coordinates and discretized velocity channels. The three key modules are:

  • Encoder EeE_e: Injects new partial observations into the world memory, via translation-equivariant convolutions or Vision Transformers (ViT), merged with memory by spatial gating/masking.
  • Memory Update: The hidden state ht(v)h_t(v) for velocity vv is updated by:

ht+1(v)=Tat1Φ(v)[Ue(ht(v),Ee(ft,ht)(v))]h_{t+1}(v) = T_{a_t}^{-1} \, \Phi(v) \left[U_e(h_t(v), E_e(f_t, h_t)(v))\right]

where Tat1T_{a_t}^{-1} is the action-induced group transformation (e.g., translation, rotation), and Φ(v)\Phi(v) shifts each velocity channel in accordance with vv. This formulation ensures that both self-motion and external object flows are respected.

  • Decoder DeD_e: Reads the reconstructed observation from the core memory, leveraging cross-attention in ViT-based implementations.

All convolutional operations are translation-equivariant; shift/roll operations directly effect the Lie group action. Velocity-channel indexing enables the network to stably track information across time in the presence of underlying symmetry flows (Lillemark et al., 3 Jan 2026, Keller, 20 Jul 2025).

3. Flow-Equivariant Recurrent Neural Networks

Conventional RNNs are not flow-equivariant and suffer from “lagging” hidden states under continuous motion. Flow Equivariant Recurrent Neural Networks (FERNNs) address this deficit by:

  • Lifting the hidden state to a joint space (ν,g)(\nu, g), where ν\nu indexes velocity generators and gg is the group state.
  • Updating the hidden state with group convolutions and explicit group actions:

ht+1(ν,g)=σ(ψ1(ν)[htV×GW](ν,g)+[ft  ^V×GU](ν,g))h_{t+1}(\nu, g) = \sigma\left(\psi_1(\nu) \cdot [h_t \star_{V\times G} W](\nu, g) + [f_t \;\hat\star_{V\times G} U](\nu,g)\right)

  • Implementing self-motion as a cyclic shift or permutation, and external flow as an independent roll along velocity channels.
  • Decoding by pooling across velocity channels for invariance when needed.

Plugging FERNNs into the transition kernel of FloWM yields roll-out models with strictly equivariant latent evolution and marked improvements in sample efficiency and rollout stability (Keller, 20 Jul 2025).

4. Loss Functions, Training, and Evaluation Protocols

FloWM adopts losses matched to the deterministic or generative setting:

  • Reconstruction Loss: For single-step rollout or prediction, mean squared error (MSE) between predicted and ground-truth observations. For GNN FloWM:

Lsup=1Ni=1Nui(t+Δt)u^i(t+Δt)2L_{\rm sup} = \frac{1}{N} \sum_{i=1}^N \|u_i(t+\Delta t) - \hat u_i(t+\Delta t)\|^2

  • Equivariance Consistency Loss: Explicitly measures the model’s commutativity with group actions:

Leq=MSE[Rf(u0,x,f),f(Ru0,Rx,Rf)]L_{\rm eq} = \mathrm{MSE}[ R f(u_0, x, f), f(R u_0, R x, R f)]

  • Diffusion-based Losses: For baseline models in video prediction, optimize denoising prediction error as usual for DFoT/DFoT-SSM baselines.

Training employs standard schemes (e.g., Adam optimizer, data augmentation for baselines, stochastic weight averaging) but with key architectural gains arising from the model’s built-in symmetry, not regularization (Shankar et al., 2023, Lillemark et al., 3 Jan 2026).

5. Empirical Results and Benchmarking

FloWM has been empirically validated on both physics-based and video world-modeling tasks:

Scenario FloWM Main Results Key Baselines Summary Insight
2D Cylinder Flow R20.997R^2 \approx 0.997 (eq), $0.992$ (eq_scl), $0.963$ (neq) neq, neq_aug Equivariant variants outperform others
Marsigli Flow R20.997R^2 \approx 0.997 (eq_scl), $0.97$ (eq), $0.98$ (neq) neq, neq_aug Invariants give best tradeoff
MNIST World (2D) MSE 0.0005\approx 0.0005 (FloWM), \gg $0.1$ (DFoT) DFoT, DFoT-SSM FloWM stable out to 150 steps
3D Block World MSE 6×1046 \times 10^{-4} (FloWM), >1.2×102>1.2\times 10^{-2} (DFoT) DFoT, DFoT-SSM Outperforms baselines, longer rollout

FloWM consistently achieves lower error metrics, higher R2R^2 for long-horizon rollouts, and greater qualitative stability. In ablation, removing velocity channels or self-motion equivariance degrades long-term memory coherence, and non-equivariant models hallucinate or lose track after a fraction of the tested rollout horizon. Incorporating invariant scalar descriptors as the primary latent representation can substantially reduce computational overhead without significant loss in forecast accuracy (Shankar et al., 2023, Lillemark et al., 3 Jan 2026, Keller, 20 Jul 2025).

6. Limitations and Future Directions

Current FloWM instantiations primarily handle rigid motions (constant-velocity translations, 90° discrete rotations) but not non-rigid, semantic, or stochastic transitions. The velocity channel quantization is a discretization of truly continuous flows. In complex 3D environments, full analytic equivariance is not guaranteed in the current ViT encoders, which must learn such mappings from data. Scaling to larger or more open worlds will require hierarchical or dynamic latent maps and potentially more efficient representations of the Lie algebra (e.g., learning continuous generators and employing matrix exponentials for latent updates).

Potential directions include integration with hierarchical memory, continuous Lie-algebra learning, sparse and compressed updates, stochastic latent extensions for modeling uncertainty, and tighter connection to planning modules or analytically equivariant 3D encoders (Lillemark et al., 3 Jan 2026).

FloWM represents the convergence of geometric deep learning, tensor field neural networks, and structured memory models. It operationalizes recent results showing that flow-equivariant sequence models, including recurrent neural networks and multi-scale GNNs, can dramatically outperform unstructured or static-equivariant networks in both sample efficiency and generalization to unseen dynamical regimes (Keller, 20 Jul 2025). Its architecture codifies group-theoretic priors directly at the memory and transition level, leading to latent representations that remain stable and physically interpretable over long time horizons. This supports efficient learning in embodied and partially observed environments, with demonstrated applications in fluid simulation forecasting and embodied world modeling (Shankar et al., 2023, Lillemark et al., 3 Jan 2026, Keller, 20 Jul 2025).

A plausible implication is that as group-theoretic methods mature for more complex actions and broader classes of symmetry, such approaches may become foundational tools for robust data-driven modeling of high-dimensional continuous environments.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Flow Equivariant World Models (FloWM).