Geometry-Aware Cached Token Merging
- The paper introduces methodologies that leverage geometric priors to guide token merging, significantly reducing computation while preserving essential spatial details.
- It details formal algorithms for 3D point clouds, multi-view images, and video data, using cached merge decisions and token similarity metrics to optimize transformer performance.
- Empirical results demonstrate up to 95% token reduction and substantial speedups across benchmarks with minimal accuracy loss.
Geometry-Aware Cached Token Merging refers to a set of methodologies in transformer-based vision and multimodal architectures that reduce computational and memory costs by selectively merging redundant tokens, guided by geometric priors, and leveraging cached computation for efficiency. The core principle is to identify and fuse tokens that are geometrically and semantically similar, while preserving essential spatial or geometric detail for downstream prediction. These strategies are critical for scaling attention mechanisms over high-dimensional inputs such as 3D point clouds, large-scale multi-view image sequences, and spatio-temporal video data.
1. Conceptual Foundations and Rationale
Transformer architectures for visual data typically employ dense token grids for self-attention, resulting in quadratic complexity with respect to the number of tokens. Empirical analysis in 3D vision models and video LLMs has demonstrated that many tokens—especially those representing smooth, redundant regions—contribute minimally to model output and can be merged without significant accuracy loss (Tran et al., 7 Nov 2025, Shu et al., 4 Dec 2025, Hyun et al., 10 Jul 2025). Geometry-aware token merging leverages local or global geometric correlations (e.g., spatial proximity, feature similarity, boundary information) to guide which tokens are merged, focusing reduction where redundancy is highest while allocating more capacity to areas of geometric or semantic interest, such as object boundaries, contours, or rare structures.
A key insight is that token similarity and local geometric structure persist stably across adjacent layers and frames, enabling the use of cached merge decisions and intermediate representations for further acceleration (Shu et al., 4 Dec 2025, Hyun et al., 10 Jul 2025, Gong et al., 26 Sep 2025).
2. Formal Methods and Algorithms
2.1 Point Cloud Transformers (gitmerge3D)
In 3D point cloud models, each scene comprises tokens organized into patches of points. Each token encodes both its Euclidean coordinate and learned feature , with concatenation .
A global, geometry-aware bipartite graph is constructed with vertices , where is the centroid of patch :
The redundancy metric (energy score) for each token is defined as:
where projects the centroid to feature space. Patches with mean energy above threshold are merged moderately, while others are merged aggressively. Merging within each patch subdivides tokens into spatial bins (e.g., Morton code ordering), merges local tokens into a destination token by feature and spatial averaging, and returns the reduced set (Tran et al., 7 Nov 2025).
2.2 Multi-View/Multiframe Images (LiteVGGT)
For multi-frame image transformers, each frame yields patch tokens. All tokens are concatenated for global self-attention ( cost). Geometry-aware merging proceeds by ranking tokens according to fused importance scores—Sobel edge magnitude and feature variance maps:
with . Tokens are partitioned into:
- GA-Tokens: top 10% by (never merged)
- Dst-Tokens: anchor tokens by lowest per patch (per frame, all tokens on first frame)
- Src-Tokens: the rest, merged to nearest Dst by cosine similarity
Merge indices (src-to-dst assignments) are cached and reused for layers, reducing overhead by approximately 25% with negligible accuracy impact (Shu et al., 4 Dec 2025).
2.3 Spatio-Temporal Video Data (STTM)
STTM employs geometry-aware, multi-granular token merging for video LLMs:
- Spatial: Quadtree partitioning averages regions, stops splitting if all child-to-parent cosine similarities . Regions with low variation collapse into coarse tokens.
- Temporal: Tokens in consecutive frames are merged based on region overlap and cosine similarity threshold ; a union-find algorithm clusters temporal token graphs, and merged tokens inherit earliest parent's identity.
Merge masks and resultant token sequences are query-agnostic and can be reused for all questions on the same video, allowing full KV-cache reuse and avoiding recomputation (Hyun et al., 10 Jul 2025).
2.4 2D Grid-Based Methods (CubistMerge)
CubistMerge operates on 2D grids (ViTs, windowed attention), merging adjacent tokens within rows and columns using a max-magnitude-per-dimension rule:
Spatial structure is preserved, as merges occur only between neighbors; merge patterns are deterministic and cache-friendly (Gong et al., 26 Sep 2025).
3. Caching Mechanisms and Cross-Layer/Frame Reuse
Geometry-aware cached token merging schemes exploit the stable token similarity and spatial configuration across layers, frames, and even queries to cache intermediate computations:
- Patch/region centroids and feature projections () are cached for 3D point clouds.
- Merge assignments (src-to-dst pairs) are cached every layers in LiteVGGT and reused for inference acceleration.
- For video LLMs with STTM, merged token sequences and their KV-cache states are reused across questions, given that spatial and temporal merging are query-agnostic.
- CubistMerge's merge masks and indices are precomputable and reused for identical resolutions/layers, enabling zero-overhead application in production pipelines.
Cache keys are typically indexed by geometric configuration (e.g., patch or region identity), threshold parameters, and model-specific feature projections.
4. Computational and Memory Efficiency
Geometry-aware cached token merging delivers substantial improvements in computational and memory efficiency:
| Model | Speedup | Token Reduction | Accuracy Drop | Memory (GB) |
|---|---|---|---|---|
| gitmerge3D | ×5.3 (ScanNet PTv3) | 90–95% | <1 mIoU | 10.12 → 1.6 |
| LiteVGGT | ×10 (1000 images) | ~65% | <1% CD | >96 → ~45 |
| STTM | ×2–3 (video QA) | 50–70% | <2% | — |
| CubistMerge | ×1.25 (SAM-H) | ~20% | 0.7% mIoU | — |
These methods maintain performance within 1 absolute metric point across standard segmentation, reconstruction, and QA benchmarks; fine-tuning is often sufficient to recover or even slightly surpass baseline results (Tran et al., 7 Nov 2025, Shu et al., 4 Dec 2025, Hyun et al., 10 Jul 2025, Gong et al., 26 Sep 2025).
5. Integration with Transformer Architectures
All geometry-aware cached merging strategies are compatible with transformers employing window attention, decomposed relative positional embeddings, and rotary positional embeddings. These methods preserve spatial indices and grid structure, so subsequent layers remain unaffected—no need for special interpolation or lookup strategies. Merge masks and cache protocols are parameter-free, deterministic, and applicable across both inference and training phases.
Notably, these approaches outperform previous non-spatial or query-dependent token reduction methods in accuracy-retention trade-offs and compatibility with production workloads (Gong et al., 26 Sep 2025, Hyun et al., 10 Jul 2025).
6. Experimental Validation and Practical Impact
Empirical validation is extensive:
- gitmerge3D establishes that 3D point cloud transformers are over-tokenized, with up to 95% token reduction feasible without significant performance drop. Semantic segmentation, reconstruction, and detection all benefit, even with minimal fine-tuning (Tran et al., 7 Nov 2025).
- LiteVGGT processes 1000-image, large-scale scenes with <10% of the original FLOPs, minimal memory footprint, and preserves geometric detail via anchor selection (Shu et al., 4 Dec 2025). Caching merge indices and applying FP8 quantization further reduce latency and peak memory by 33% and 25%, respectively, without catastrophic geometry degradation.
- STTM offers state-of-the-art tradeoffs for video LLMs: 2–3× speed-up, <2% QA accuracy reduction at 30–50% token budgets, with cache reuse enabling efficient multi-query pipelines (Hyun et al., 10 Jul 2025).
- CubistMerge achieves consistent speedups (up to ×1.25) with minimal accuracy drops, preserving grid-structured token maps required by advanced ViT and segmentation backbones. Its cache-friendliness supports multi-exit and prompt-driven networks (Gong et al., 26 Sep 2025).
7. Limitations and Future Research Directions
Geometry-aware cached token merging relies on the stability and redundancy of geometric features. While adaptively merging tokens minimizes information loss in practice, pathological cases—such as scenes with high-frequency structure or rapid nonrigid motion—may challenge cache validity and merging strategies, necessitating refinement or fallback to dense representations. Key open areas include optimal fusion rules for multimodal attention, dynamic cache invalidation criteria, and integration with self-supervised learned merging policies. This suggests further exploration of cross-modal geometric priors and cache-aware optimization in high-scale transformer inference.
Primary sources: "How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?" (Tran et al., 7 Nov 2025), "LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging" (Shu et al., 4 Dec 2025), "Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs" (Hyun et al., 10 Jul 2025), "CubistMerge: Spatial-Preserving Token Merging For Diverse ViT Backbones" (Gong et al., 26 Sep 2025).