InfiniteVGGT: Infinite-Horizon 3D Scene Transformer

Updated 8 January 2026

InfiniteVGGT is a causal visual geometry transformer that uses a fixed-size, diversity-pruned key-value cache to achieve infinite-horizon, real-time 3D scene understanding.
It incorporates an adaptive rolling memory mechanism that maintains an immutable anchor set and prunes redundant information, ensuring stable and efficient streaming inference.
Empirical evaluations on long-term 3D reconstruction benchmarks demonstrate InfiniteVGGT’s ability to maintain bounded error and minimal drift over unbounded frame sequences.

InfiniteVGGT is a causal visual geometry transformer architecture that achieves infinite-horizon, real-time 3D scene understanding by introducing a rolling, bounded memory for multi-head self-attention. It addresses the fundamental scalability barrier posed by traditional Vision Gated Generative Transformers (VGGT), which, while effective in batch mode, are unsuitable for persistent, live operation due to unbounded growth in memory and compute. InfiniteVGGT integrates a fixed-size, diversity-pruned key-value (KV) cache into the causal transformer framework, enabling efficient streaming inference without catastrophic drift, while remaining compatible with hardware-efficient attention implementations such as FlashAttention. Its efficacy is demonstrated on the Long3D benchmark, comprising sequences up to 10,000 frames, setting a new standard for long-term stability in large-scale 3D geometry estimation (Dinya et al., 20 Nov 2025, Yuan et al., 5 Jan 2026).

1. Background and Model Evolution

The original VGGT operates in a batch regime, alternating frame-level and global multi-head self-attention across deep stacks (24 layers), decoding outputs for all frames in a batch simultaneously. For input images $\{I_1,...,I_N\}$ , outputs $(g_i, D_i, P_i, T_i)_{i=1}^N$ (camera pose, depth, 3D points, feature tracks) are produced via:

$(\mathbf{g}_i, D_i, P_i, T_i)_{i=1}^N = \phi(\mathcal{F}_\theta(\{I_i\}),\mathcal{G}_\theta(\{I_i\}))$

Batch models do not scale to continuous streams; storing all past keys/values leads to $O(t)$ growth in causal streaming mode. Naive streaming architectures, using a growing KV cache,

$\mathcal{C}_t = \{(K^{(l)}, V^{(l)})\}_{l=1}^L, \quad K^{(l)} = [K^{(l)}_{1},...,K^{(l)}_{t}], V^{(l)} = [V^{(l)}_{1},...,V^{(l)}_{t}]$

ultimately exhaust available memory. Previous streaming approaches also experience significant long-term alignment drift, preventing reliable 3D reconstruction over extended sequences (Yuan et al., 5 Jan 2026).

2. Bounded Rolling Memory and Diversity-Driven Pruning

InfiniteVGGT introduces a rolling memory via a fixed-budget, adaptive KV cache,

$\mathcal{C}_t^{(l,h)} = \mathcal{C}_{\text{anc}}^{(l,h)} \cup \mathcal{C}_{t,\text{cand}}^{(l,h)}$

where each layer $l$ and head $h$ maintains:

Immutable anchor set ( $\mathcal{C}_{\text{anc}}$ ): tokens from the initial frames, guaranteeing a global frame of reference.
Mutable candidate set ( $\mathcal{C}_{t, \text{cand}}$ ): tokens from frames $2$ to $t$ .

When the candidate set exceeds budget $B^{(l,h)}$ , tokens are pruned by key-space diversity:

Normalize keys: $\hat{k}_i = k_i/\|k_i\|$
Mean key: $\mu^{(l,h)} = \frac{1}{|\hat{\mathcal{K}}_{t, \text{cand}}^{(l,h)}|} \sum_{\hat{k}\in \hat{\mathcal{K}}_{t, \text{cand}}^{(l,h)}} \hat{k}$
Diversity score per key: $s_\mathrm{div}^{(l,h)}(\hat{k}_i) = -\cos(\mu^{(l,h)}, \hat{k}_i)$

The most informative (most distant from mean) keys are retained. Token budgets $B^{(l,h)}$ are dynamically allocated across layers and heads by softmax-layer-average diversity. This ensures information-rich layers receive greater capacity.

A high-level pseudocode for update:

for each layer l=1...L, head h=1...H:
    C_cand ← C_cand^{(l,h)} ∪ {(K_t^{(l,h)},V_t^{(l,h)})}
    if |C_cand| > B^{(l,h)}:
        compute s_div^{(l,h)}(k) ∀ k
        keep top B^{(l,h)} by s_div
    C_cand^{(l,h)} ← C_cand
end for

3. Causal Attention and FlashAttention Compatibility

Each causal temporal layer implements standard MHSA with causal masking. For queries $Q \in \mathbb{R}^{n \times d}$ , keys $K \in \mathbb{R}^{m \times d}$ , and values $V \in \mathbb{R}^{m \times d}$ ,

$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d}} + M_{\text{causal}} \right) V$

Rolling-memory ensures per-layer cache size is fixed and small, so FlashAttention (blockwise attention with IO-aware kernel) operates efficiently in $O(B^2)$ time and $O(B)$ memory. Because pruning is attention-agnostic (does not require computing the attention matrix), computational and memory overhead is kept minimal (Yuan et al., 5 Jan 2026).

4. InfiniteVGGT for Semantic SLAM: Practical Pipeline

Downstream, the rolling-memory InfiniteVGGT architecture enables memory-efficient semantic SLAM pipelines (Dinya et al., 20 Nov 2025):

Sliding window: Incoming frames are processed in overlapping blocks of $W$ frames, all attention/computation restricted to this window. Earlier blocks are condensed to fixed-size submaps. Memory footprint is $M(W) = O(d \cdot W) + M_0$ ; latency scales linearly in $W$ .
Submap alignment: Each window/block yields a 3D submap $S_i$ . Submap $S_{i}$ is aligned to $S_{i-1}$ via a similarity transform (optimized by closed-form Umeyama least-squares fit; optionally refined by photometric and geometric reprojection error) and fused into a global map.
Instance tracking: 2D instance masks (from YOLOv9e) are linked into tracklets and aggregated to form 3D objects. Cross-block re-identification compares centroid sequences using Chamfer distance to maintain consistent object IDs.
Change detection: Each object has a confidence score and last-seen timestamp; confidence decays if an expected object is not detected, supporting dynamic world modeling.

5. Empirical Evaluation: Long-Term 3D Reconstruction

The Long3D benchmark supports rigorous, long-horizon evaluation: five scenes, 2,000–10,000 frames each, annotated with dense global point clouds (Yuan et al., 5 Jan 2026).

Scene	#Frames	Method	Acc. (↓)	Comp. (↓)	NC. (↑)	CD. (↓)
Classroom	2,128	CUT3R	0.496/0.374	0.085/0.036	0.520/0.525	0.291
		TTT3R	0.396/0.319	0.081/0.035	0.530/0.540	0.239
		InfiniteVGGT	0.357/0.298	0.057/0.033	0.576/0.612	0.207
Dormitory	4,208	CUT3R	1.800/1.372	0.404/0.090	0.501/0.495	1.102
		TTT3R	1.965/1.749	0.329/0.100	0.515/0.509	1.147
		InfiniteVGGT	1.438/1.159	0.575/0.089	0.526/0.538	1.007

InfiniteVGGT consistently reduces Chamfer Distance (CD) and increases normal consistency (NC) versus streaming and offline baselines. On sequences longer than several thousand frames, baselines suffer significant drift or run out of memory, while InfiniteVGGT maintains bounded error.

On public datasets (7-Scenes, NRGBD), InfiniteVGGT holds near-constant error (Acc. $\sim0.04$ , NC $\sim0.65$ at 500 frames), whereas all unpruned online baselines exhaust memory (Yuan et al., 5 Jan 2026).

6. Design Implications and Limitations

InfiniteVGGT achieves stability by retaining long-range, key-diversity-maximizing tokens and an immutable anchor set, which establishes a fixed global coordinate frame and suppresses drift. Adaptive budget allocation concentrates representation on temporally dynamic layers, maintaining expressivity despite bounded size.

Nonetheless, in extremely large and geometrically diverse scenes, completeness (Comp.) can fall behind some baselines. This suggests that most recent, rapidly changing tokens dominate the fixed budget, possibly neglecting rare semantic regions. Directions for improvement include introducing semantic redundancy measures, learnable eviction policies, and dynamic budgets responsive to scene-change rates. Extending the approach to heterogeneous sensor fusion (e.g., RGB+LiDAR or omnidirectional cameras) is a further area for investigation (Yuan et al., 5 Jan 2026).

In summary, InfiniteVGGT converts the batch-centric VGGT architecture into a truly infinite-horizon, streaming visual geometry transformer, achieving $O(1)$ memory and bounded latency over unbounded frame sequences via a rolling, key-diversity-pruned KV cache. This innovation establishes a new operational baseline for persistent 3D perception and mapping in real-time, long-duration applications (Dinya et al., 20 Nov 2025, Yuan et al., 5 Jan 2026).