Inverse-Forward Dynamics with Vector Quantization
- The paper introduces GCQ, a unified framework that integrates inverse and forward dynamics using grid-like vector quantization guided by continuous attractor neural networks.
- It employs analytic bump-shift operations for forward predictions and a greedy search mechanism for inverse dynamics, enhancing efficiency in action-conditioned planning.
- Empirical results demonstrate GCQ’s superior long-horizon performance and robust generalization across diverse benchmarks compared to traditional two-stage world models.
Inverse-Forward Dynamics with Vector Quantization refers to the unification of forward modeling (predicting future states given current state and action) and inverse modeling (inferring actions given transitions between states) in a latent space segmented by discrete codes. Grid-like Code Quantization (GCQ) operationalizes this paradigm by leveraging continuous attractor neural networks (CANNs) to construct grid-like discretized cognitive maps, compressing high-dimensional observation-action sequences into a set of codewords that reflect structured neural dynamics. GCQ supports analytic forward and inverse modeling, yielding efficient planning and robust long-horizon prediction by exploiting action-conditioned lattice structure within the code space (Peng et al., 16 Oct 2025).
1. Theoretical Foundations and Problem Formulation
GCQ addresses the challenge of modeling agent-environment interaction sequences by encoding observations into a latent space , conditioned on actions . The principal aim is to establish a framework that achieves:
- Compression of observation–action trajectories into a discrete, grid-like latent code.
- Analytic prediction of future codes (forward model) given a current code and action.
- Recovery of generative actions (inverse model) from sequential code transitions.
The overarching objective is to minimize reconstruction error over , maintain “commitment” to the discrete lattice codes, and optionally regularize with forward and inverse model losses. This unified perspective enables integrated world modeling in a way that avoids the decoupling of spatial and temporal abstraction seen in two-stage approaches.
2. Grid-like Codebook Construction with Continuous Attractor Neural Networks
Central to GCQ is the CANN, whose architecture comprises neurons indexed by on a topological torus. The neural dynamics are governed by:
with interactions defined by a Gaussian kernel:
and divisive normalization specified by:
The attractor landscape of the CANN forms 0 bump-like centers 1, each corresponding to a unique codeword on the grid. Selecting 2 centers 3 as codewords yields a codebook 4, each codeword realized as a “flattened bump” attractor. For increased capacity, 5 independent CANNs are run in parallel, forming a latent state 6.
3. Action-Conditioned Sequence Quantization and Model Losses
Given an action–observation sequence and encoder outputs 7, GCQ assembles quantized latent trajectories by evaluating, for each bump 8, all possible codeword rollouts under the observed action subsequence 9. Each candidate sequence is constructed as:
0
with the optimal codeword index 1 chosen by minimum 2 distance:
3
The final quantized trajectory stacks each bump’s best-matching template, ensuring spatiotemporal coherence imposed by the action-conditioned matching. Coding remains differentiable via the straight-through estimator during backpropagation. The overall loss is
4
where sg indicates stop-gradient, and 5 scales the code commitment penalty. Optional forward and inverse dynamics losses can be incorporated as weighted terms in the joint objective.
4. Analytic Forward and Inverse Dynamics in Code Space
GCQ enables an exact latent forward step via analytic bump-shift operations: for code 6 and action 7,
8
This structure obviates the need for a learned forward-prediction head, though in principle an auxiliary network 9 can also be trained. Inverse dynamics proceed by direct search:
0
or by training a compact inverse predictor 1. The discrete set 2 (often of cardinality 3 for 4 bumps) ensures that “greedy” search is computationally tractable for action recovery.
5. Planning and Trajectory Optimization
Planning within GCQ’s representation is driven by the analytic 5 operator on discrete codes. For any start and goal observation, planning consists of:
- Quantizing both the start and goal observations: 6, 7.
- Iteratively inferring optimal actions 8 until the code matches the goal.
- Each planning step runs in constant time over the discrete action set, yielding high efficiency.
This enables repeated greedy minimization in code space with no need for search over raw observations or pixel predictions, supporting robust long-horizon planning through an analytically tractable representation.
6. Empirical Results, Ablations, and Benchmarks
GCQ was validated on diverse environments:
- 2DMaze (virtual maze navigation),
- Google Street View (GSV; real scene sequences with translation and rotation),
- MPI3D and 3DShapes (feature-space action environments).
Performance was compared to two-stage world models (VQ-VAE + UNet, or VQ-VAE + Transformer) using FID-based and PSNR-based reconstruction and prediction metrics. On GSV, GCQ (112M parameters, ViT backbone) achieved FIDr ≈ 42.6, FIDp ≈ 43.4, PSNRr ≈ 27.8, and PSNRp ≈ 27.8, outperforming VQ+UNet (96M) and VQ+Transformer (121M), particularly at long prediction horizons. GCQ’s prediction fidelity remains nearly invariant with horizon length, while baselines degrade rapidly.
Ablation studies show that:
| Encoder Architecture | FIDp (lower better) | Stability |
|---|---|---|
| ResNet | Not specified | Lower |
| Hybrid | Not specified | Lower |
| ViT | 43.4 | Highest |
Fixed predefined codebook centers (as CANN attractors) outperformed learnable codebooks (FIDp 43.4 vs 47.8). Representation capacity and action set cardinality are controlled by code size 9, number of bumps 0, and CANN dimensionality.
GCQ additionally displays robust zero-shot transfer for long-range prediction in novel mazes and 3D environments, reflecting both the inductive bias and generalization conferred by grid-like codes.
7. Significance and Theoretical Perspective
GCQ reinterprets vector quantization by embedding it within the neural coding paradigm of grid cells and continuous attractor dynamics. The approach departs from conventional vector quantization—which treats inputs as independent and static—by leveraging stateful, action-conditioned codeword selection, yielding a unified cognitive mapping framework capable of exact forward and inverse modeling. The resulting world model is fully end-to-end differentiable and integrates spatial, temporal, and goal-directed reasoning. This suggests links between efficient sequence modeling in artificial agents and the emergence of structured neural codes in biological navigation systems (Peng et al., 16 Oct 2025).