Papers
Topics
Authors
Recent
Search
2000 character limit reached

ZipMap: Linear-Time 3D Reconstruction

Updated 9 March 2026
  • The paper introduces ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction by zipping hundreds of image frames into a compact scene state using TTT fast-weight layers.
  • It overcomes the quadratic scalability bottleneck of transformer-based approaches by updating fast-weight layers once per layer, ensuring efficient processing and high-fidelity geometry reconstruction.
  • Empirical results demonstrate competitive accuracy and significantly reduced runtime on benchmarks like ScanNetV2, KITTI, and DTU, enabling real-time novel-view querying and effective streaming reconstruction.

ZipMap is a stateful feed-forward model for 3D reconstruction from image collections that achieves linear-time, bidirectional scene processing via the use of test-time training (TTT) fast-weight layers. Designed to overcome the quadratic scaling bottleneck of transformer-based approaches, ZipMap efficiently zips hundreds of input frames into a compact hidden scene state in a single forward pass, enabling high-fidelity geometry reconstruction and real-time novel-view querying with competitive or superior accuracy compared to state-of-the-art quadratic-time models such as VGGT and π3\pi^3 (Jin et al., 4 Mar 2026).

1. Motivation and Background

Transformer-based models have driven advances in multi-view 3D vision but suffer from the computational cost of global self-attention, which scales quadratically with the number of frames NN. Methods like VGGT and π3\pi^3 exhibit O(N2)O(N^2) complexity, rendering them impractical for large-scale image collections. Sequential and streaming approaches exist but often degrade reconstruction quality due to their reduced capacity for bidirectional information aggregation. ZipMap addresses these limitations by introducing a feed-forward stateful architecture capable of both linear runtime and bidirectional scene reasoning through rapid scene aggregation (Jin et al., 4 Mar 2026).

2. Model Architecture and Scene State Aggregation

ZipMap's architecture consists of L=24L=24 identical Transformer-style blocks, each interleaving two operations:

  • Local Window Attention: Within each input image (or ray-map), a self-attention mechanism with rotary positional embeddings aggregates spatially local patch tokens (p×dp \times d array per view). This attention operates per-frame, leading to complexity O(p2)O(p^2) per image, independent of NN.
  • Global Large-Chunk TTT Layer: Inspired by LaCT, ZipMap utilizes a SwiGLU-MLP fW(x)=W2(SiLU(W1x)(W3x))f_W(x) = W_2(\mathrm{SiLU}(W_1 x) \odot (W_3 x)) as a "fast-weight" memory function. Parameters W={W1,W2,W3}W = \{W_1, W_2, W_3\} are updated using a global in-context associative memory built from all image tokens:
    • Tokens xix_i are projected to keys kik_i, values viv_i, and queries qiq_i.
    • A virtual TTT reconstruction loss, Lttt(W)=i=1NpfW(ki)Tvi\mathcal{L}_{\text{ttt}}(W) = -\sum_{i=1}^{N \cdot p} f_W(k_i)^T v_i, is minimized via a single fast-weight update.
    • The gradient is normalized using a Newton–Schulz orthonormalization, and WW is renormalized (including L2L_2 regularization) before its application.
    • Updated weights W^\hat{W} are applied to produce global context outputs oi=fW^(qi)o'_i = f_{\hat{W}}(q_i), with outputs gated and RMS-normalized.

Because the TTT update is done once per layer (rather than per token pair), ZipMap achieves hidden-state aggregation cost of O(Npd2)O(N p d^2), compared to O((Np)2d)O((N p)^2 d) in global self-attention layers.

3. Hidden Scene State Representation

ZipMap "zips" NN images into a compact scene representation stored in the fast weights of each layer. Letting x1,,xNx_1,\dots,x_N be per-frame token embeddings after local attention, hidden-state aggregation is defined as:

s=Zip({x1,,xN}){W(1),,W(L)}s = \text{Zip}(\{x_1,\dots,x_N\}) \triangleq \{W^{(1)}, \dots, W^{(L)}\}

where W()W^{(\ell)} is the MLP fast-weight state after the TTT update at layer \ell. Each block executes:

  • W()TTTUpdate(W(1);{ki,j,vi,j})W^{(\ell)} \leftarrow \text{TTTUpdate}(W^{(\ell-1)}; \{k_{i,j}, v_{i,j}\}), where ki,jk_{i,j} and vi,jv_{i,j} are projected from xi[j]x_i[j],
  • The gradient WL\nabla_W \mathcal{L} is aggregated over all tokens, orthonormalized, and applied,
  • At the final block, the sequence {W()}\{W^{(\ell)}\} forms a scene state of size O(Ld2)O(L d^2).

Novel-view query rays are processed efficiently by applying W^()\hat{W}^{(\ell)} to tokenized queries without further gradient updates.

4. Computational Complexity and Training Workflow

ZipMap's design delivers strict linear runtime in NN:

  • Prior approaches: Tquad(N)=O((Np)2d)T_{\text{quad}}(N) = O((N p)^2 d),
  • ZipMap: Tzip(N)=O(Np2d+LNpd2+Lpd2)T_{\text{zip}}(N) = O(N p^2 d + L N p d^2 + L p d^2), assuming constant pp and LL with respect to NN.

Training loss combines several geometric, photometric, and camera alignment objectives:

L=Lpoint+Ldepth+wcLcam+Lcolort+Ldeptht+Lpoint-normal+Ldepth-grad\mathcal{L} = \mathcal{L}_{\text{point}} + \mathcal{L}_{\text{depth}} + w_c \mathcal{L}_{\text{cam}} + \mathcal{L}_{\text{color}}^t + \mathcal{L}_{\text{depth}}^t + \mathcal{L}_{\text{point-normal}} + \mathcal{L}_{\text{depth-grad}}

Components include mean point error, depth error, camera pose error, and RGB/depth supervision for novel views (MSE and LPIPS), along with normal and depth gradient matching penalties.

Single-pass pseudocode executes tokenization, local window attention, TTT gradient updates, and prediction heads (pose, depth, confidence, points), outputting the final fast-weights state and scene predictions.

5. Empirical Performance

ZipMap attains state-of-the-art accuracy while significantly reducing runtime on large image sets. Performance comparisons (NVIDIA H100, N=750N=750 frames):

Model Complexity Runtime (s)
VGGT O(N2)O(N^2) 200.36
π3\pi^3 O(N2)O(N^2) 151.16
CUT3R O(N)O(N) 31.25
TTT3R O(N)O(N) 31.20
ZipMap O(N)O(N) 9.999

Additional benchmark results demonstrate:

  • ScanNetV2 absolute trajectory error (ATE): ZipMap 0.034 m vs. π3\pi^3 0.030 m vs. VGGT 0.035 m,
  • KITTI video depth: AbsRel = 0.057 (ZipMap) vs. 0.038 (π3\pi^3) vs. 0.073 (VGGT),
  • RealEstate10K camera AUC@5°: 53.34% (ZipMap) vs. 63.10% (π3\pi^3) vs. 38.71% (VGGT),
  • DTU point-map: accuracy 1.228 mm, completeness 1.649 mm, normal consistency 0.675, matching or surpassing prior methods (Jin et al., 4 Mar 2026).

6. Real-Time Querying, Streaming, and Extensions

Because the aggregated scene state is stored in fast weights, novel-view querying after initial aggregation consists solely of apply steps, enabling constant-time (\sim100 FPS) querying independently of the size of the input collection. Streaming reconstruction is supported by processing each new frame ItI_t via W(t)TTTUpdate(W(t1);{kt,j,vt,j})W^{(t)} \leftarrow \text{TTTUpdate}(W^{(t-1)}; \{k_{t,j}, v_{t,j}\}), assigning pose and local geometry without recurrent error accumulation and in O(1)O(1) per-frame time. This suggests improved scalability and robustness for online scene reconstruction scenarios.

Empirically, streaming ZipMap outperforms previous linear-time sequential models (CUT3R, TTT3R) on camera, depth, and point-map benchmarks, even with shorter context lengths (Jin et al., 4 Mar 2026).

7. Significance and Future Implications

By leveraging TTT-based fast-weight state aggregation, ZipMap establishes a scalable paradigm for large-scale, high-fidelity 3D reconstruction and scene understanding. Its bidirectional, linear-time aggregation, efficient scene state representation, and applicability to both batch and streaming settings open new directions for real-time and large-scale 3D perception pipelines using feed-forward architectures (Jin et al., 4 Mar 2026). A plausible implication is the potential for extending ZipMap-like architectures to even larger temporal or spatial scales, or to other modalities that benefit from rapid global context aggregation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ZipMap.