ZipMap: Linear-Time 3D Reconstruction

Updated 9 March 2026

The paper introduces ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction by zipping hundreds of image frames into a compact scene state using TTT fast-weight layers.
It overcomes the quadratic scalability bottleneck of transformer-based approaches by updating fast-weight layers once per layer, ensuring efficient processing and high-fidelity geometry reconstruction.
Empirical results demonstrate competitive accuracy and significantly reduced runtime on benchmarks like ScanNetV2, KITTI, and DTU, enabling real-time novel-view querying and effective streaming reconstruction.

ZipMap is a stateful feed-forward model for 3D reconstruction from image collections that achieves linear-time, bidirectional scene processing via the use of test-time training (TTT) fast-weight layers. Designed to overcome the quadratic scaling bottleneck of transformer-based approaches, ZipMap efficiently zips hundreds of input frames into a compact hidden scene state in a single forward pass, enabling high-fidelity geometry reconstruction and real-time novel-view querying with competitive or superior accuracy compared to state-of-the-art quadratic-time models such as VGGT and $\pi^3$ (Jin et al., 4 Mar 2026).

1. Motivation and Background

Transformer-based models have driven advances in multi-view 3D vision but suffer from the computational cost of global self-attention, which scales quadratically with the number of frames $N$ . Methods like VGGT and $\pi^3$ exhibit $O(N^2)$ complexity, rendering them impractical for large-scale image collections. Sequential and streaming approaches exist but often degrade reconstruction quality due to their reduced capacity for bidirectional information aggregation. ZipMap addresses these limitations by introducing a feed-forward stateful architecture capable of both linear runtime and bidirectional scene reasoning through rapid scene aggregation (Jin et al., 4 Mar 2026).

2. Model Architecture and Scene State Aggregation

ZipMap's architecture consists of $L=24$ identical Transformer-style blocks, each interleaving two operations:

Local Window Attention: Within each input image (or ray-map), a self-attention mechanism with rotary positional embeddings aggregates spatially local patch tokens ( $p \times d$ array per view). This attention operates per-frame, leading to complexity $O(p^2)$ per image, independent of $N$ .
Global Large-Chunk TTT Layer: Inspired by LaCT, ZipMap utilizes a SwiGLU-MLP $f_W(x) = W_2(\mathrm{SiLU}(W_1 x) \odot (W_3 x))$ $f_{W} (x) = W_{2} (SiLU (W_{1} x) ⊙ (W_{3} x))$ as a "fast-weight" memory function. Parameters $W = \{W_1, W_2, W_3\}$ $W = {W_{1}, W_{2}, W_{3}}$ are updated using a global in-context associative memory built from all image tokens:
- Tokens $x_i$ are projected to keys $k_i$ , values $v_i$ , and queries $q_i$ .
- A virtual TTT reconstruction loss, $\mathcal{L}_{\text{ttt}}(W) = -\sum_{i=1}^{N \cdot p} f_W(k_i)^T v_i$ , is minimized via a single fast-weight update.
- The gradient is normalized using a Newton–Schulz orthonormalization, and $W$ is renormalized (including $L_2$ regularization) before its application.
- Updated weights $\hat{W}$ are applied to produce global context outputs $o'_i = f_{\hat{W}}(q_i)$ , with outputs gated and RMS-normalized.

Because the TTT update is done once per layer (rather than per token pair), ZipMap achieves hidden-state aggregation cost of $O(N p d^2)$ , compared to $O((N p)^2 d)$ in global self-attention layers.

3. Hidden Scene State Representation

ZipMap "zips" $N$ images into a compact scene representation stored in the fast weights of each layer. Letting $x_1,\dots,x_N$ be per-frame token embeddings after local attention, hidden-state aggregation is defined as:

$s = \text{Zip}(\{x_1,\dots,x_N\}) \triangleq \{W^{(1)}, \dots, W^{(L)}\}$

where $W^{(\ell)}$ is the MLP fast-weight state after the TTT update at layer $\ell$ . Each block executes:

$W^{(\ell)} \leftarrow \text{TTTUpdate}(W^{(\ell-1)}; \{k_{i,j}, v_{i,j}\})$ , where $k_{i,j}$ and $v_{i,j}$ are projected from $x_i[j]$ ,
The gradient $\nabla_W \mathcal{L}$ is aggregated over all tokens, orthonormalized, and applied,
At the final block, the sequence $\{W^{(\ell)}\}$ forms a scene state of size $O(L d^2)$ .

Novel-view query rays are processed efficiently by applying $\hat{W}^{(\ell)}$ to tokenized queries without further gradient updates.

4. Computational Complexity and Training Workflow

ZipMap's design delivers strict linear runtime in $N$ :

Prior approaches: $T_{\text{quad}}(N) = O((N p)^2 d)$ ,
ZipMap: $T_{\text{zip}}(N) = O(N p^2 d + L N p d^2 + L p d^2)$ , assuming constant $p$ and $L$ with respect to $N$ .

Training loss combines several geometric, photometric, and camera alignment objectives:

$\mathcal{L} = \mathcal{L}_{\text{point}} + \mathcal{L}_{\text{depth}} + w_c \mathcal{L}_{\text{cam}} + \mathcal{L}_{\text{color}}^t + \mathcal{L}_{\text{depth}}^t + \mathcal{L}_{\text{point-normal}} + \mathcal{L}_{\text{depth-grad}}$

Components include mean point error, depth error, camera pose error, and RGB/depth supervision for novel views (MSE and LPIPS), along with normal and depth gradient matching penalties.

Single-pass pseudocode executes tokenization, local window attention, TTT gradient updates, and prediction heads (pose, depth, confidence, points), outputting the final fast-weights state and scene predictions.

5. Empirical Performance

ZipMap attains state-of-the-art accuracy while significantly reducing runtime on large image sets. Performance comparisons (NVIDIA H100, $N=750$ frames):

Model	Complexity	Runtime (s)
VGGT	$O(N^2)$	200.36
$\pi^3$	$O(N^2)$	151.16
CUT3R	$O(N)$	31.25
TTT3R	$O(N)$	31.20
ZipMap	$O(N)$	9.999

Additional benchmark results demonstrate:

ScanNetV2 absolute trajectory error (ATE): ZipMap 0.034 m vs. $\pi^3$ 0.030 m vs. VGGT 0.035 m,
KITTI video depth: AbsRel = 0.057 (ZipMap) vs. 0.038 ( $\pi^3$ ) vs. 0.073 (VGGT),
RealEstate10K camera AUC@5°: 53.34% (ZipMap) vs. 63.10% ( $\pi^3$ ) vs. 38.71% (VGGT),
DTU point-map: accuracy 1.228 mm, completeness 1.649 mm, normal consistency 0.675, matching or surpassing prior methods (Jin et al., 4 Mar 2026).

6. Real-Time Querying, Streaming, and Extensions

Because the aggregated scene state is stored in fast weights, novel-view querying after initial aggregation consists solely of apply steps, enabling constant-time ( $\sim$ 100 FPS) querying independently of the size of the input collection. Streaming reconstruction is supported by processing each new frame $I_t$ via $W^{(t)} \leftarrow \text{TTTUpdate}(W^{(t-1)}; \{k_{t,j}, v_{t,j}\})$ , assigning pose and local geometry without recurrent error accumulation and in $O(1)$ per-frame time. This suggests improved scalability and robustness for online scene reconstruction scenarios.

Empirically, streaming ZipMap outperforms previous linear-time sequential models (CUT3R, TTT3R) on camera, depth, and point-map benchmarks, even with shorter context lengths (Jin et al., 4 Mar 2026).

7. Significance and Future Implications

By leveraging TTT-based fast-weight state aggregation, ZipMap establishes a scalable paradigm for large-scale, high-fidelity 3D reconstruction and scene understanding. Its bidirectional, linear-time aggregation, efficient scene state representation, and applicability to both batch and streaming settings open new directions for real-time and large-scale 3D perception pipelines using feed-forward architectures (Jin et al., 4 Mar 2026). A plausible implication is the potential for extending ZipMap-like architectures to even larger temporal or spatial scales, or to other modalities that benefit from rapid global context aggregation.

Markdown Report Issue Upgrade to Chat

References (1)

ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ZipMap.