Geometry-aware Cached Token Merging

Updated 1 June 2026

The paper introduces geometry-aware cached token merging that exploits spatial redundancy to achieve up to 10× speedup in vision transformers.
It employs inter-layer and temporal caching with hierarchical merging based on geometric and semantic criteria to reduce quadratic complexity.
The approach maintains downstream task performance while enabling efficient 3D reconstruction, robotics, and multi-view processing under limited resources.

Geometry-aware cached token merging is a class of algorithmic strategies designed to accelerate and scale vision transformers by exploiting geometric redundancy in visual data—particularly in multi-view, video, or depth-augmented contexts—while minimally impacting downstream task fidelity. By analyzing and leveraging spatial or geometric priors, these methods merge and cache redundant tokens according to geometric structure (e.g., local similarities, depth regions, spatial consistency), and reuse merge decisions or token representations across layers and/or frames, yielding significant reductions in computational cost and GPU memory. Such approaches are core to models like LiteVGGT and DepthCache, achieving up to 10× speedup and enabling efficient 3D reconstruction or closed-loop robotic control on large-scale visual inputs with thousands of frames (Shu et al., 4 Dec 2025, Li et al., 11 Mar 2026). Geometry-aware cached token merging incorporates methodologies for region selection, hierarchical merging, inter-layer or temporal caching, and task-specific gating, synthesizing algorithms tailored for the spatial and semantic structure of vision-language, geometric, or robotics models.

1. Rationale and Geometric Redundancy Exploitation

Self-attention models for 3D perception and multi-frame vision tasks (e.g., VGGT, VLA, navigation) operate on token sequences whose length scales as the number of frames times tokens per frame, resulting in quadratic compute and memory costs in $O((NT)^2)$ . For large $N$ (hundreds to thousands), this leads to severe bottlenecks—runtimes of minutes per forward pass and GPU memory far exceeding commodity accelerators (e.g., >70 GiB) (Shu et al., 4 Dec 2025). However, in such settings, high redundancy exists because:

3D geometry induces spatial correlation: Tokens corresponding to overlapping 3D surface regions (across or within frames) have highly similar embeddings due to geometric alignment.
Inter-layer similarity stability: The pairwise similarity structure between tokens varies slowly across transformer layers, implying that merge/grouping decisions remain valid for multiple layers or timesteps.
Task-driven attention locality: Certain semantic or geometric regions (e.g., sharp edges, near-field workspace, salient objects) demand higher token resolution; others (textureless regions, distant background, low-confidence or low-attention areas) tolerate aggressive merging.

This motivates the class of geometry-aware cached merging algorithms, which are parameterized by a geometric or semantic criterion for determining token importance, a mechanism for spatially/temporally consistent merging, and a strategy for caching and reuse.

2. Formal Algorithmic Components and Notation

Geometry-aware cached token merging modules typically operate as follows:

Token set $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ : Input embeddings from $N$ frames, $T$ tokens each, dimension $d$ .
Cosine similarity $S(i,j)$ : $S(i,j) = \frac{x_i^\top x_j}{\|x_i\|\|x_j\|}$ , measuring semantic or geometric alignment.
Importance score $\psi_i$ : Blending e.g., local image gradient ( $\Psi_g$ ) and token variance ( $N$ 0), $N$ 1; can also be derived from attention, depth, or confidence (Shu et al., 4 Dec 2025, Chen et al., 18 Nov 2025).
Region partitioning and anchoring: Tokens are divided into:
- Critical geometry set $N$ 2 (top $N$ 3 per frame by $N$ 4) to preserve,
- Anchor set $N$ 5 for merging,
- Source set $N$ 6 as merge candidates,
- with various spatial selection and grid anchoring strategies.
Merge operation: Each $N$ 7 is assigned to $N$ 8, updating $N$ 9 to the mean of itself and its assigned sources. Only $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 0 and $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 1 tokens propagate.

Merging steps are interleaved with attention and transformer blocks; unmerge operations replicate anchor outputs to the merged set prior to layers needing the original layout (e.g., fine-level frame attention or dense prediction).

3. Caching and Reuse Strategies

A central advantage of geometry-aware merging is the ability to cache grouping or merge assignments across space, layers, or time:

Inter-layer caching: In LiteVGGT, transformer layers are grouped into intervals of fixed size $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 2; merge indices and token partitions are computed once per interval and reused for $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 3 layers, leveraging slow evolution of token similarity. For $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 4 layers, only $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 5 recomputations are needed (Shu et al., 4 Dec 2025).
Temporal caching: DepthCache and VLN-Cache extend this, reusing merged token groupings across consecutive frames. Only a fraction of merges is applied at each time step, ensuring smooth token set evolution and temporal consistency (Li et al., 11 Mar 2026, Zheng et al., 7 Mar 2026).
Dynamic reuse gating: Semantic and visual gates determine whether to reuse or refresh tokens. For VLN, view-aligned remapping uses camera intrinsics and pose to align tokens across frames. A semantic saliency filter uses cross-attention weights to prohibit reuse for tokens whose task relevance is high or rapidly changing (Zheng et al., 7 Mar 2026).

Caching not only saves compute but also avoids abrupt changes in latent representation, maintaining stability for downstream decoders.

4. Geometry-Aware Merging Modalities

Different implementations instantiate geometry-aware merging according to application-specific structure:

Gradient/variance-based (LiteVGGT): $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 6 draws on pixel-level edge strength and feature variance, promoting preservation of geometrically informative tokens and smoothing in redundant regions (Shu et al., 4 Dec 2025).
Depth-guided region merging (DepthCache): Tokens are partitioned by depth ( $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 7, $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 8, $\mathcal{X} = \{x_1,\ldots,x_{NT}\}$ 9 regions) with region-specific merge ratios $N$ 0; near-field is preserved, background is merged more aggressively. Temporal caching is applied by carrying region groupings across frames and merging incrementally (Li et al., 11 Mar 2026).
Spatially-structured (CubistMerge): Merges preserve 2D grid layout and spatial integrity; reduction proceeds row-then-column via adjacency-constrained bipartite matching. Embeddings are merged using max-magnitude-per-dimension rather than averages (Gong et al., 26 Sep 2025).
Attention/confidence-guided (Co-Me): Merging is conditional on predicted per-token confidence/distillation scores, favoring the averaging of low-importance tokens, with attention bias correction to maintain global attention mass consistency. Merge/split indices are cached for efficient repeated use (Chen et al., 18 Nov 2025).
Dynamic multi-modal gating (VLN-Cache): Combines geometric remapping (e.g., based on depth, pose, or optical flow) with cross-layer semantic gating and layer-adaptive entropy policies, constraining reuse budgets according to task and visual dynamics (Zheng et al., 7 Mar 2026).

5. Computational Complexity and Empirical Outcomes

Geometry-aware cached merging produces dramatic reductions in computational burden:

Quadratic to sub-quadratic reduction: Reducing token count from $N$ 1 to $N$ 2 yields per-layer attention complexity $N$ 3, with merge recomputation amortized across $N$ 4 layers or multiple frames.
Empirical speed/memory gains: In LiteVGGT, peak memory drops from >70 GiB (OOM) to 45 GiB, with end-to-end latency reduced by up to 10× (ScanNet-50, 1k frames) and Chamfer Distance remaining within 5–10% of the baseline (Shu et al., 4 Dec 2025). DepthCache executes up to 1.28× faster with <1% success rate drop for closed-loop robotic control, outperforming uniform merging or pruning (Li et al., 11 Mar 2026). Co-Me achieves up to 11.3× speedup on 3D tasks with minimal error increase (Chen et al., 18 Nov 2025).
Restoration mechanisms: Fine-tuning aggregator/prediction heads and quantization to FP8 (for compatible backbones) enable further gains with negligible degradation (Shu et al., 4 Dec 2025).

Method/Setting	Speedup	Memory	Quality Impact
LiteVGGT (ScanNet-50)	10×	45 GiB	$N$ 510% CD delta
DepthCache (LIBERO, VLA)	1.28×	--	$N$ 61% SR drop
Co-Me (VGGT 512-frame)	11.3×	--	$N$ 70.044 cm depth L1 increase

6. Practical Integration and Model Compatibility

Geometry-aware cached token merging can be deployed with minimal modification to existing transformer pipelines:

Insertion points: Merge/caching modules are placed before global-attention blocks; “unmerge” restores token layouts before spatial/decoder attention or dense output.
Preservation of spatial modules: Approaches such as CubistMerge guarantee compatibility with 2D positional layouts and window-attention backbones, such as SAM, DINOv3, and Mask2Former (Gong et al., 26 Sep 2025).
Task-specific extensions: Depth+velocity-aware merging (DepthCache), dynamic semantic gating (VLN-Cache), and streaming multi-view operation (Co-Me) extend applicability to robotics, navigation, SLAM, and 3D reconstruction. Fine-tuning and FP8 quantization provide accuracy recovery and further acceleration (Shu et al., 4 Dec 2025, Chen et al., 18 Nov 2025, Zheng et al., 7 Mar 2026).

These methods can be incorporated as post-hoc, training-free drop-in modules or as part of end-to-end fine-tuned pipelines, provided the geometric priors and gating logic are compatible with the underlying task and model.

7. Evaluation, Limitations, and Future Directions

Empirical benchmarks across large-scale 3D and robotics datasets demonstrate that geometry-aware cached token merging preserves task performance and spatial fidelity across a range of backbones and architectures. However, effectiveness depends on the rate of geometric/semantic change, the selection of gating and region thresholds, and architecture-specific compatibility. Integrating additional dynamic criteria such as optical flow, richer spatial priors, or more adaptive semantic gating may further enhance the robustness of cached merging in open-world scenarios (Zheng et al., 7 Mar 2026). Exploring hybrid schemes that unify region-based, flow-based, and attention-aware gates remains an active area of research.

Geometry-aware cached token merging has become a leading technique for enabling real-time, large-scale vision-and-language systems with strict computational restrictions across 3D perception, robotics, and sequential vision tasks (Shu et al., 4 Dec 2025, Li et al., 11 Mar 2026, Gong et al., 26 Sep 2025, Chen et al., 18 Nov 2025, Zheng et al., 7 Mar 2026).