Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 13 tok/s
GPT-5 High 17 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 198 tok/s Pro
2000 character limit reached

Token Merging for 3D Vision

Updated 5 September 2025
  • Token merging for 3D vision is a method that aggregates redundant tokens based on semantic, spatial, and geometric cues to reduce computational load while preserving essential scene structure.
  • These approaches extend 2D transformer strategies by incorporating motion, multi-view, and multimodal fusion to handle the unique challenges of 3D data.
  • Empirical studies show significant efficiency gains—up to 2x runtime reduction and minimal accuracy drop—with applications in 3D detection, segmentation, and reconstruction.

Token merging for 3D vision encompasses a family of techniques that aggressively reduce the computational and memory burden of transformer-based models in three-dimensional spatial or spatiotemporal tasks by aggregating or fusing tokens (i.e., structured representations of spatial patches, voxels, or points) based on semantic or geometric similarity. These methods preserve essential scene information by designing merging criteria that are spatially, semantically, or dynamically informed, thereby enabling efficient scaling of transformers to high-resolution, multi-view, or volumetric 3D vision benchmarks. As detailed in numerous primary sources, most contemporary strategies for 3D token merging build upon advances from the 2D transformer literature but introduce additional constraints and innovations to handle spatial consistency, geometric fidelity, motion cues, and multimodal (e.g., image, point cloud, or depth) integration.

1. Motivation and Fundamental Principles

The quadratic complexity of self-attention with respect to token count is the dominant computational bottleneck in Vision Transformers (ViTs), especially acute in 3D vision tasks, which often require high-resolution spatial grids, dense point clouds, or aggregating multi-view observations. Token merging reduces the number of tokens as data propagates through the model, drawing inspiration from the pyramidal spatial reduction of convolutional neural networks. In 3D, redundancy is further exacerbated by overlapping multi-view projections, volumetric similarity in point clouds or voxels, and temporal correlations in video or dynamic scenes (Renggli et al., 2022, Bolya et al., 2022, Zhang et al., 1 Sep 2024).

Key theoretical underpinnings are:

  • Identification of redundant, low-salience, or similar tokens, typically by semantic similarity, spatial proximity, or geometric/temporal cues.
  • Careful design of the merging criterion to avoid collapsing critical geometric or semantic content.
  • Emphasis on spatial and structural consistency, often requiring explicit 3D or temporal awareness in the merging process.

2. Main Token Merging Methodologies for 3D Vision

2.1. Token Similarity and Structured Merging

Core merging mechanisms extend bipartite soft matching (Bolya et al., 2022) to multi-modal, dynamic, and spatially-aware 3D data:

  • PatchMerger (Renggli et al., 2022): Inserts a module between transformer layers, reducing NN input tokens to MM through a softmax-weighted linear combination: Y=softmax((XW)T)XY = \operatorname{softmax}((XW)^T) X. This is conceptually similar to fixed-query bottom-up attention and supports downstream usage at varying input granularity.
  • ToMe (Bolya et al., 2022): Employs bipartite soft matching, partitioning tokens into two sets, matching similar pairs based on key (self-attention) similarity (often cosine), and merging via averaging. Proportional attention ensures that merged token weights correctly influence subsequent softmaxes, adjusting A=softmax(QKT/(d+logs))A = \operatorname{softmax}(QK^T / (\sqrt{d} + \log s)) where ss reflects token sizes.
  • Energy-based and Spectral Approaches (Tran et al., 25 May 2024): Constructs a token graph using cosine similarities, computes a per-token “energy score” (reflecting redundancy), and merges only high-energy (redundant) clusters, rigorously preserving the intrinsic spectrum (eigen-structure) of the original token graph (see spectrum consistency theorem in PiToMe).

2.2. Spatial and Geometric Structure Integration

To preserve 3D scene layout, advanced methods incorporate geometric cues:

  • Spatial Awareness (Huang et al., 24 Jun 2025): ToSA generates pseudo spatial tokens for each patch or voxel from depth or geometric data, computes both semantic and spatial similarity matrices, and fuses them for merging. A schedule increases semantic weight α\alpha deeper into the network.
  • Viewpoint-Agnostic Fusion (Shang et al., 2022): Constructs explicit 3D positional embeddings per token via a pseudo-depth estimator and camera-parameter learning, “lifting” 2D tokens to 3D (e.g., pncam=[(unzn)/c,(vnzn)/c,zn]p_n^{\text{cam}} = [(u_n z_n)/c, (v_n z_n)/c, z_n]).
  • Motion and History Cues for Multi-View (Zhang et al., 1 Sep 2024): ToC3D utilizes history object queries with motion information to guide token selection, focusing transformer computation on salient tokens with foreground priors and dynamically routing background tokens to lightweight computation (“free path”).
  • Multimodal 3D Fusion (Wang et al., 2022, Thomas et al., 6 Jun 2025): Fuses point-based (e.g., Sonata Point Transformer) 3D features into 2D image tokens via nearest-neighbor assignment or shared indices, enhancing representation and facilitating token merging across fundamentally different sources.

2.3. Adaptive, Dynamic, and Hybrid Merging Protocols

  • Adaptive/Thresholded Merging (Lee et al., 21 May 2025, Saghatchian et al., 1 Jan 2025): Methods like ATM set per-layer, decaying similarity thresholds—θl=max{α(eβ(l1)1),θmin}\theta^l = \max\{\alpha - (e^{\beta \cdot (l-1)}-1), \theta_{\min}\}—to ensure only sufficiently similar tokens are merged, with special merging strategies (e.g., size-distinctive matching) in late layers to minimize loss from combining already-aggregated tokens.
  • Token Fusion (ToFu) (Kim et al., 2023): Combines pruning and merging, with early layers performing pruned merging (removing redundant tokens after similarity-based matching) and later layers using norm-preserving average merging (MLERP), designed to maintain the feature’s norm and avoid distributional shifts, enhancing deployment on edge devices.
  • Cached or History-Aided Merging (Saghatchian et al., 1 Jan 2025): Leverages token pair stability over time or sequential steps by caching merge indices, validated by Jaccard distance, thereby reducing redundant computation and enhancing temporal efficiency in both static and dynamic (e.g., diffusion or video) contexts.

3. Empirical Impact and Efficiency Gains

Across a broad range of experiments and datasets, token merging strategies for 3D vision have demonstrated:

Method FLOPs/Runtime Reduction Accuracy Drop Key Features/Benchmarks
PatchMerger 49–53% \leq 0.5% ViT-H/14, V-MoE, large images (Renggli et al., 2022)
ToMe up to 2×\times 0.2–0.4% ViT-L/H/MAE, video, audio (Bolya et al., 2022)
PiToMe 40–60% 0.5–0.7% ViT-MAE-H, CLIP, LLaVa (Tran et al., 25 May 2024)
ToC3D 30% (backbone) \leq 1% nuScenes 3D detection (Zhang et al., 1 Sep 2024)
ATM 30–40% 0% DeiT-T/S, training-free (Lee et al., 21 May 2025)
UMIFormer N/A SOTA IoU/F ShapeNet, multi-view reconstruction (Zhu et al., 2023)

These improvements manifest not only in image/video classification, but also in high-complexity 3D detection, volumetric segmentation, and embodied reasoning tasks. Notably, methods such as ToC3D and UMIFormer demonstrate substantial gains in real-time 3D perception, owing to adaptive token sparsification and cross-view semantic/structural merging.

4. Specialized Designs for 3D and Multimodal Vision

Token merging techniques for 3D vision often build upon or extend general ViT token reduction with the following domain-specific approaches:

  • Multimodal Substitution and Alignment (TokenFusion (Wang et al., 2022)): Dynamically identifies uninformative tokens in point cloud/image modalities and substitutes them with intermodal features, resolving redundancy, and preserves alignment via residual positional embedding.
  • Rectification and Inter-View Clustering (UMIFormer (Zhu et al., 2023)): For unstructured multi-view 3D reconstruction, correlates tokens across views with inter-view KNN, rectifies with learned offsets, and compresses through clustering-based merging (STM) into compact fixed-size representations for robust 3D shape decoding.
  • Heterogeneous/Hierarchical Merging (MonoATT (Zhou et al., 2023)): Assigns finer (higher resolution) tokens to regions of 3D importance (e.g., object outlines, distant geometry), coarser tokens to background—grouping, merging, and reconstructing pixel-level maps via multi-stage feature fusion for downstream 3D object detection.

5. Considerations for Information Preservation and Model Robustness

Central to successful merging in 3D tasks is the safeguarding of critical geometric and semantic signals:

  • Spectrum Preservation (Tran et al., 25 May 2024): PiToMe demonstrates that energy-based ordering and careful merging allow the Laplacian eigen-spectrum—which encodes structural and segmentation information—to be preserved up to a small perturbation during merging.
  • Spatial Consistency and Position Restoration (Mao et al., 30 Mar 2025, Huang et al., 24 Jun 2025): “Prune and Merge” with reconstruct matrices and spatial shortcut connections supports high-fidelity restoration of pruned spatial information. ToSA synchronizes merging for both visual and spatial token branches, offering layer-wise scheduled control over semantic vs. spatial criteria.
  • Token Diversity and Attentive Decoupling (Long et al., 2022): Dual-stage merging ensures that local attentive tokens (typically salient geometry, object edges) are preserved, while inattentive (often globally redundant) tokens are merged via density or clustering-based methods to preserve background diversity.

6. Future Directions and Open Challenges

Major research prospects in token merging for 3D vision include:

  • Dynamic, Content-Dependent Schedules: Extending fixed or layer-wise merging rates to content-driven, context-aware schedules (e.g., as in ATM, CA-ToMe), potentially for real-time scene understanding and autonomous systems.
  • End-to-End Learning of Merge Policies: Decoupled embedding modules (e.g., DTEM (Lee et al., 13 Dec 2024)) and differentiable, soft grouping operators enable adaptive merging, and future work may further unify content encoding and merge policy learning for 3D transformers.
  • 3D-centric Graphical and Spectral Theories: Application of spectral graph and geometric reasoning (cf. PiToMe, GTP-ViT (Xu et al., 2023)) to better preserve segmentation, topology, and non-Euclidean geometric structures present in raw 3D data.
  • Extending to Real-Time and Edge Deployment: Hardware-friendly designs, caching (CA-ToMe), and norm/stability preserving interpolations (MLERP in ToFu) are critical for deployment in latency-constrained or memory-limited scenarios such as robotics, AR/VR, or connected vehicles.

7. Summary Table of Key 3D Token Merging Approaches

Approach 3D-Specific Innovations Merging Criterion Applications
PatchMerger (Renggli et al., 2022) Fixed-output, position-independent, softmax routes Attention w/learned W ViT backbone, 3D patch tokens
TokenFusion (Wang et al., 2022) Modality-aware substitution, residual PE Importance + projection RGB-depth, point cloud fusion, 3D detection
ToMe (Bolya et al., 2022) Proportional attention size-aware matching Cosine sim. + size Video, audio, point cloud, multi-frame
PiToMe (Tran et al., 25 May 2024) Energy-based, spectrum preserving Energy-ordered BSM Accelerated 3D ViTs, spectral fidelity
ToC3D (Zhang et al., 1 Sep 2024) History query guided, dynamic routing Attention to query Multi-view 3D detection, nuScenes
ToSA (Huang et al., 24 Jun 2025) Depth-informed, spatial pseudo-tokens Spatial+semantic sim. VQA, embodied QA, spatial reasoning
DTEM (Lee et al., 13 Dec 2024) Decoupled, learned merging embedding Differentiable adj. Segmentation, captioning, point clouds
UMIFormer (Zhu et al., 2023) Inter-view KNN, STM clustering Edge features + KNN Multi-view 3D reconstruction

This area remains under active development, with an increasing diversity of approaches for preserving both computational efficiency and the structural integrity necessary for robust 3D perception, reasoning, and generative modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)