Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inverse-Forward Dynamics with Vector Quantization

Updated 26 May 2026
  • The paper introduces GCQ, a unified framework that integrates inverse and forward dynamics using grid-like vector quantization guided by continuous attractor neural networks.
  • It employs analytic bump-shift operations for forward predictions and a greedy search mechanism for inverse dynamics, enhancing efficiency in action-conditioned planning.
  • Empirical results demonstrate GCQ’s superior long-horizon performance and robust generalization across diverse benchmarks compared to traditional two-stage world models.

Inverse-Forward Dynamics with Vector Quantization refers to the unification of forward modeling (predicting future states given current state and action) and inverse modeling (inferring actions given transitions between states) in a latent space segmented by discrete codes. Grid-like Code Quantization (GCQ) operationalizes this paradigm by leveraging continuous attractor neural networks (CANNs) to construct grid-like discretized cognitive maps, compressing high-dimensional observation-action sequences into a set of codewords that reflect structured neural dynamics. GCQ supports analytic forward and inverse modeling, yielding efficient planning and robust long-horizon prediction by exploiting action-conditioned lattice structure within the code space (Peng et al., 16 Oct 2025).

1. Theoretical Foundations and Problem Formulation

GCQ addresses the challenge of modeling agent-environment interaction sequences (o1,a1,o2,,on)(o_1, a_1, o_2, \ldots, o_n) by encoding observations otOo_t \in \mathcal{O} into a latent space st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d, conditioned on actions atAa_t \in \mathcal{A}. The principal aim is to establish a framework that achieves:

  • Compression of observation–action trajectories into a discrete, grid-like latent code.
  • Analytic prediction of future codes (forward model) given a current code and action.
  • Recovery of generative actions (inverse model) from sequential code transitions.

The overarching objective is to minimize reconstruction error over o1:no_{1:n}, maintain “commitment” to the discrete lattice codes, and optionally regularize with forward and inverse model losses. This unified perspective enables integrated world modeling in a way that avoids the decoupling of spatial and temporal abstraction seen in two-stage approaches.

2. Grid-like Codebook Construction with Continuous Attractor Neural Networks

Central to GCQ is the CANN, whose architecture comprises N2N^2 neurons indexed by (θi,ϕj)(\theta_i, \phi_j) on a topological torus. The neural dynamics are governed by:

τtUθ,ϕ=Uθ,ϕ+ρθ,ϕWθ,ϕ(θ,ϕ)rθ,ϕ(t)+Iθ,ϕ(t)\tau\, \partial_t U_{\theta,\phi} = -U_{\theta,\phi} + \rho \sum_{\theta',\phi'} W_{\theta,\phi}(\theta',\phi') r_{\theta',\phi'}(t) + I_{\theta,\phi}(t)

with interactions defined by a Gaussian kernel:

Wθ,ϕ(θ,ϕ)=J2πa2exp(θθS2+ϕϕS22a2)W_{\theta,\phi}(\theta',\phi') = \frac{J}{2\pi a^2} \exp\left( -\frac{ \lVert \theta-\theta' \rVert_S^2 + \lVert \phi-\phi' \rVert_S^2 }{2a^2} \right)

and divisive normalization specified by:

rθ,ϕ=Uθ,ϕ21+kρWU2r_{\theta,\phi} = \frac{U_{\theta,\phi}^2}{1 + k\rho \sum W U^2}

The attractor landscape of the CANN forms otOo_t \in \mathcal{O}0 bump-like centers otOo_t \in \mathcal{O}1, each corresponding to a unique codeword on the grid. Selecting otOo_t \in \mathcal{O}2 centers otOo_t \in \mathcal{O}3 as codewords yields a codebook otOo_t \in \mathcal{O}4, each codeword realized as a “flattened bump” attractor. For increased capacity, otOo_t \in \mathcal{O}5 independent CANNs are run in parallel, forming a latent state otOo_t \in \mathcal{O}6.

3. Action-Conditioned Sequence Quantization and Model Losses

Given an action–observation sequence and encoder outputs otOo_t \in \mathcal{O}7, GCQ assembles quantized latent trajectories by evaluating, for each bump otOo_t \in \mathcal{O}8, all possible codeword rollouts under the observed action subsequence otOo_t \in \mathcal{O}9. Each candidate sequence is constructed as:

st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d0

with the optimal codeword index st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d1 chosen by minimum st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d2 distance:

st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d3

The final quantized trajectory stacks each bump’s best-matching template, ensuring spatiotemporal coherence imposed by the action-conditioned matching. Coding remains differentiable via the straight-through estimator during backpropagation. The overall loss is

st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d4

where sg indicates stop-gradient, and st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d5 scales the code commitment penalty. Optional forward and inverse dynamics losses can be incorporated as weighted terms in the joint objective.

4. Analytic Forward and Inverse Dynamics in Code Space

GCQ enables an exact latent forward step via analytic bump-shift operations: for code st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d6 and action st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d7,

st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d8

This structure obviates the need for a learned forward-prediction head, though in principle an auxiliary network st=fϕ(ot)SRds_t = f_\phi(o_t) \in \mathcal{S} \subseteq \mathbb{R}^d9 can also be trained. Inverse dynamics proceed by direct search:

atAa_t \in \mathcal{A}0

or by training a compact inverse predictor atAa_t \in \mathcal{A}1. The discrete set atAa_t \in \mathcal{A}2 (often of cardinality atAa_t \in \mathcal{A}3 for atAa_t \in \mathcal{A}4 bumps) ensures that “greedy” search is computationally tractable for action recovery.

5. Planning and Trajectory Optimization

Planning within GCQ’s representation is driven by the analytic atAa_t \in \mathcal{A}5 operator on discrete codes. For any start and goal observation, planning consists of:

  1. Quantizing both the start and goal observations: atAa_t \in \mathcal{A}6, atAa_t \in \mathcal{A}7.
  2. Iteratively inferring optimal actions atAa_t \in \mathcal{A}8 until the code matches the goal.
  3. Each planning step runs in constant time over the discrete action set, yielding high efficiency.

This enables repeated greedy minimization in code space with no need for search over raw observations or pixel predictions, supporting robust long-horizon planning through an analytically tractable representation.

6. Empirical Results, Ablations, and Benchmarks

GCQ was validated on diverse environments:

  • 2DMaze (virtual maze navigation),
  • Google Street View (GSV; real scene sequences with translation and rotation),
  • MPI3D and 3DShapes (feature-space action environments).

Performance was compared to two-stage world models (VQ-VAE + UNet, or VQ-VAE + Transformer) using FID-based and PSNR-based reconstruction and prediction metrics. On GSV, GCQ (112M parameters, ViT backbone) achieved FIDr ≈ 42.6, FIDp ≈ 43.4, PSNRr ≈ 27.8, and PSNRp ≈ 27.8, outperforming VQ+UNet (96M) and VQ+Transformer (121M), particularly at long prediction horizons. GCQ’s prediction fidelity remains nearly invariant with horizon length, while baselines degrade rapidly.

Ablation studies show that:

Encoder Architecture FIDp (lower better) Stability
ResNet Not specified Lower
Hybrid Not specified Lower
ViT 43.4 Highest

Fixed predefined codebook centers (as CANN attractors) outperformed learnable codebooks (FIDp 43.4 vs 47.8). Representation capacity and action set cardinality are controlled by code size atAa_t \in \mathcal{A}9, number of bumps o1:no_{1:n}0, and CANN dimensionality.

GCQ additionally displays robust zero-shot transfer for long-range prediction in novel mazes and 3D environments, reflecting both the inductive bias and generalization conferred by grid-like codes.

7. Significance and Theoretical Perspective

GCQ reinterprets vector quantization by embedding it within the neural coding paradigm of grid cells and continuous attractor dynamics. The approach departs from conventional vector quantization—which treats inputs as independent and static—by leveraging stateful, action-conditioned codeword selection, yielding a unified cognitive mapping framework capable of exact forward and inverse modeling. The resulting world model is fully end-to-end differentiable and integrates spatial, temporal, and goal-directed reasoning. This suggests links between efficient sequence modeling in artificial agents and the emergence of structured neural codes in biological navigation systems (Peng et al., 16 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inverse-Forward Dynamics with Vector Quantization.