ZipMap: Linear-Time 3D Reconstruction
- The paper introduces ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction by zipping hundreds of image frames into a compact scene state using TTT fast-weight layers.
- It overcomes the quadratic scalability bottleneck of transformer-based approaches by updating fast-weight layers once per layer, ensuring efficient processing and high-fidelity geometry reconstruction.
- Empirical results demonstrate competitive accuracy and significantly reduced runtime on benchmarks like ScanNetV2, KITTI, and DTU, enabling real-time novel-view querying and effective streaming reconstruction.
ZipMap is a stateful feed-forward model for 3D reconstruction from image collections that achieves linear-time, bidirectional scene processing via the use of test-time training (TTT) fast-weight layers. Designed to overcome the quadratic scaling bottleneck of transformer-based approaches, ZipMap efficiently zips hundreds of input frames into a compact hidden scene state in a single forward pass, enabling high-fidelity geometry reconstruction and real-time novel-view querying with competitive or superior accuracy compared to state-of-the-art quadratic-time models such as VGGT and (Jin et al., 4 Mar 2026).
1. Motivation and Background
Transformer-based models have driven advances in multi-view 3D vision but suffer from the computational cost of global self-attention, which scales quadratically with the number of frames . Methods like VGGT and exhibit complexity, rendering them impractical for large-scale image collections. Sequential and streaming approaches exist but often degrade reconstruction quality due to their reduced capacity for bidirectional information aggregation. ZipMap addresses these limitations by introducing a feed-forward stateful architecture capable of both linear runtime and bidirectional scene reasoning through rapid scene aggregation (Jin et al., 4 Mar 2026).
2. Model Architecture and Scene State Aggregation
ZipMap's architecture consists of identical Transformer-style blocks, each interleaving two operations:
- Local Window Attention: Within each input image (or ray-map), a self-attention mechanism with rotary positional embeddings aggregates spatially local patch tokens ( array per view). This attention operates per-frame, leading to complexity per image, independent of .
- Global Large-Chunk TTT Layer: Inspired by LaCT, ZipMap utilizes a SwiGLU-MLP as a "fast-weight" memory function. Parameters are updated using a global in-context associative memory built from all image tokens:
- Tokens are projected to keys , values , and queries .
- A virtual TTT reconstruction loss, , is minimized via a single fast-weight update.
- The gradient is normalized using a Newton–Schulz orthonormalization, and is renormalized (including regularization) before its application.
- Updated weights are applied to produce global context outputs , with outputs gated and RMS-normalized.
Because the TTT update is done once per layer (rather than per token pair), ZipMap achieves hidden-state aggregation cost of , compared to in global self-attention layers.
3. Hidden Scene State Representation
ZipMap "zips" images into a compact scene representation stored in the fast weights of each layer. Letting be per-frame token embeddings after local attention, hidden-state aggregation is defined as:
where is the MLP fast-weight state after the TTT update at layer . Each block executes:
- , where and are projected from ,
- The gradient is aggregated over all tokens, orthonormalized, and applied,
- At the final block, the sequence forms a scene state of size .
Novel-view query rays are processed efficiently by applying to tokenized queries without further gradient updates.
4. Computational Complexity and Training Workflow
ZipMap's design delivers strict linear runtime in :
- Prior approaches: ,
- ZipMap: , assuming constant and with respect to .
Training loss combines several geometric, photometric, and camera alignment objectives:
Components include mean point error, depth error, camera pose error, and RGB/depth supervision for novel views (MSE and LPIPS), along with normal and depth gradient matching penalties.
Single-pass pseudocode executes tokenization, local window attention, TTT gradient updates, and prediction heads (pose, depth, confidence, points), outputting the final fast-weights state and scene predictions.
5. Empirical Performance
ZipMap attains state-of-the-art accuracy while significantly reducing runtime on large image sets. Performance comparisons (NVIDIA H100, frames):
| Model | Complexity | Runtime (s) |
|---|---|---|
| VGGT | 200.36 | |
| 151.16 | ||
| CUT3R | 31.25 | |
| TTT3R | 31.20 | |
| ZipMap | 9.999 |
Additional benchmark results demonstrate:
- ScanNetV2 absolute trajectory error (ATE): ZipMap 0.034 m vs. 0.030 m vs. VGGT 0.035 m,
- KITTI video depth: AbsRel = 0.057 (ZipMap) vs. 0.038 () vs. 0.073 (VGGT),
- RealEstate10K camera AUC@5°: 53.34% (ZipMap) vs. 63.10% () vs. 38.71% (VGGT),
- DTU point-map: accuracy 1.228 mm, completeness 1.649 mm, normal consistency 0.675, matching or surpassing prior methods (Jin et al., 4 Mar 2026).
6. Real-Time Querying, Streaming, and Extensions
Because the aggregated scene state is stored in fast weights, novel-view querying after initial aggregation consists solely of apply steps, enabling constant-time (100 FPS) querying independently of the size of the input collection. Streaming reconstruction is supported by processing each new frame via , assigning pose and local geometry without recurrent error accumulation and in per-frame time. This suggests improved scalability and robustness for online scene reconstruction scenarios.
Empirically, streaming ZipMap outperforms previous linear-time sequential models (CUT3R, TTT3R) on camera, depth, and point-map benchmarks, even with shorter context lengths (Jin et al., 4 Mar 2026).
7. Significance and Future Implications
By leveraging TTT-based fast-weight state aggregation, ZipMap establishes a scalable paradigm for large-scale, high-fidelity 3D reconstruction and scene understanding. Its bidirectional, linear-time aggregation, efficient scene state representation, and applicability to both batch and streaming settings open new directions for real-time and large-scale 3D perception pipelines using feed-forward architectures (Jin et al., 4 Mar 2026). A plausible implication is the potential for extending ZipMap-like architectures to even larger temporal or spatial scales, or to other modalities that benefit from rapid global context aggregation.