Token Merging for 3D Vision

Updated 5 September 2025

Token merging for 3D vision is a method that aggregates redundant tokens based on semantic, spatial, and geometric cues to reduce computational load while preserving essential scene structure.
These approaches extend 2D transformer strategies by incorporating motion, multi-view, and multimodal fusion to handle the unique challenges of 3D data.
Empirical studies show significant efficiency gains—up to 2x runtime reduction and minimal accuracy drop—with applications in 3D detection, segmentation, and reconstruction.

Token merging for 3D vision encompasses a family of techniques that aggressively reduce the computational and memory burden of transformer-based models in three-dimensional spatial or spatiotemporal tasks by aggregating or fusing tokens (i.e., structured representations of spatial patches, voxels, or points) based on semantic or geometric similarity. These methods preserve essential scene information by designing merging criteria that are spatially, semantically, or dynamically informed, thereby enabling efficient scaling of transformers to high-resolution, multi-view, or volumetric 3D vision benchmarks. As detailed in numerous primary sources, most contemporary strategies for 3D token merging build upon advances from the 2D transformer literature but introduce additional constraints and innovations to handle spatial consistency, geometric fidelity, motion cues, and multimodal (e.g., image, point cloud, or depth) integration.

1. Motivation and Fundamental Principles

The quadratic complexity of self-attention with respect to token count is the dominant computational bottleneck in Vision Transformers (ViTs), especially acute in 3D vision tasks, which often require high-resolution spatial grids, dense point clouds, or aggregating multi-view observations. Token merging reduces the number of tokens as data propagates through the model, drawing inspiration from the pyramidal spatial reduction of convolutional neural networks. In 3D, redundancy is further exacerbated by overlapping multi-view projections, volumetric similarity in point clouds or voxels, and temporal correlations in video or dynamic scenes (Renggli et al., 2022, Bolya et al., 2022, Zhang et al., 1 Sep 2024).

Key theoretical underpinnings are:

Identification of redundant, low-salience, or similar tokens, typically by semantic similarity, spatial proximity, or geometric/temporal cues.
Careful design of the merging criterion to avoid collapsing critical geometric or semantic content.
Emphasis on spatial and structural consistency, often requiring explicit 3D or temporal awareness in the merging process.

2. Main Token Merging Methodologies for 3D Vision

2.1. Token Similarity and Structured Merging

Core merging mechanisms extend bipartite soft matching (Bolya et al., 2022) to multi-modal, dynamic, and spatially-aware 3D data:

PatchMerger (Renggli et al., 2022): Inserts a module between transformer layers, reducing $N$ input tokens to $M$ through a softmax-weighted linear combination: $Y = \operatorname{softmax}((XW)^T) X$ . This is conceptually similar to fixed-query bottom-up attention and supports downstream usage at varying input granularity.
ToMe (Bolya et al., 2022): Employs bipartite soft matching, partitioning tokens into two sets, matching similar pairs based on key (self-attention) similarity (often cosine), and merging via averaging. Proportional attention ensures that merged token weights correctly influence subsequent softmaxes, adjusting $A = \operatorname{softmax}(QK^T / (\sqrt{d} + \log s))$ where $s$ reflects token sizes.
Energy-based and Spectral Approaches (Tran et al., 25 May 2024): Constructs a token graph using cosine similarities, computes a per-token “energy score” (reflecting redundancy), and merges only high-energy (redundant) clusters, rigorously preserving the intrinsic spectrum (eigen-structure) of the original token graph (see spectrum consistency theorem in PiToMe).

2.2. Spatial and Geometric Structure Integration

To preserve 3D scene layout, advanced methods incorporate geometric cues:

Spatial Awareness (Huang et al., 24 Jun 2025): ToSA generates pseudo spatial tokens for each patch or voxel from depth or geometric data, computes both semantic and spatial similarity matrices, and fuses them for merging. A schedule increases semantic weight $\alpha$ deeper into the network.
Viewpoint-Agnostic Fusion (Shang et al., 2022): Constructs explicit 3D positional embeddings per token via a pseudo-depth estimator and camera-parameter learning, “lifting” 2D tokens to 3D (e.g., $p_n^{\text{cam}} = [(u_n z_n)/c, (v_n z_n)/c, z_n]$ ).
Motion and History Cues for Multi-View (Zhang et al., 1 Sep 2024): ToC3D utilizes history object queries with motion information to guide token selection, focusing transformer computation on salient tokens with foreground priors and dynamically routing background tokens to lightweight computation (“free path”).
Multimodal 3D Fusion (Wang et al., 2022, Thomas et al., 6 Jun 2025): Fuses point-based (e.g., Sonata Point Transformer) 3D features into 2D image tokens via nearest-neighbor assignment or shared indices, enhancing representation and facilitating token merging across fundamentally different sources.

2.3. Adaptive, Dynamic, and Hybrid Merging Protocols

Adaptive/Thresholded Merging (Lee et al., 21 May 2025, Saghatchian et al., 1 Jan 2025): Methods like ATM set per-layer, decaying similarity thresholds— $\theta^l = \max\{\alpha - (e^{\beta \cdot (l-1)}-1), \theta_{\min}\}$ —to ensure only sufficiently similar tokens are merged, with special merging strategies (e.g., size-distinctive matching) in late layers to minimize loss from combining already-aggregated tokens.
Token Fusion (ToFu) (Kim et al., 2023): Combines pruning and merging, with early layers performing pruned merging (removing redundant tokens after similarity-based matching) and later layers using norm-preserving average merging (MLERP), designed to maintain the feature’s norm and avoid distributional shifts, enhancing deployment on edge devices.
Cached or History-Aided Merging (Saghatchian et al., 1 Jan 2025): Leverages token pair stability over time or sequential steps by caching merge indices, validated by Jaccard distance, thereby reducing redundant computation and enhancing temporal efficiency in both static and dynamic (e.g., diffusion or video) contexts.

3. Empirical Impact and Efficiency Gains

Across a broad range of experiments and datasets, token merging strategies for 3D vision have demonstrated:

Method	FLOPs/Runtime Reduction	Accuracy Drop	Key Features/Benchmarks
PatchMerger	49–53%	$\leq$ 0.5%	ViT-H/14, V-MoE, large images (Renggli et al., 2022)
ToMe	up to 2 $\times$	0.2–0.4%	ViT-L/H/MAE, video, audio (Bolya et al., 2022)
PiToMe	40–60%	0.5–0.7%	ViT-MAE-H, CLIP, LLaVa (Tran et al., 25 May 2024)
ToC3D	30% (backbone)	$\leq$ 1%	nuScenes 3D detection (Zhang et al., 1 Sep 2024)
ATM	30–40%	0%	DeiT-T/S, training-free (Lee et al., 21 May 2025)
UMIFormer	N/A	SOTA IoU/F	ShapeNet, multi-view reconstruction (Zhu et al., 2023)

These improvements manifest not only in image/video classification, but also in high-complexity 3D detection, volumetric segmentation, and embodied reasoning tasks. Notably, methods such as ToC3D and UMIFormer demonstrate substantial gains in real-time 3D perception, owing to adaptive token sparsification and cross-view semantic/structural merging.

4. Specialized Designs for 3D and Multimodal Vision

Token merging techniques for 3D vision often build upon or extend general ViT token reduction with the following domain-specific approaches:

Multimodal Substitution and Alignment (TokenFusion (Wang et al., 2022)): Dynamically identifies uninformative tokens in point cloud/image modalities and substitutes them with intermodal features, resolving redundancy, and preserves alignment via residual positional embedding.
Rectification and Inter-View Clustering (UMIFormer (Zhu et al., 2023)): For unstructured multi-view 3D reconstruction, correlates tokens across views with inter-view KNN, rectifies with learned offsets, and compresses through clustering-based merging (STM) into compact fixed-size representations for robust 3D shape decoding.
Heterogeneous/Hierarchical Merging (MonoATT (Zhou et al., 2023)): Assigns finer (higher resolution) tokens to regions of 3D importance (e.g., object outlines, distant geometry), coarser tokens to background—grouping, merging, and reconstructing pixel-level maps via multi-stage feature fusion for downstream 3D object detection.

5. Considerations for Information Preservation and Model Robustness

Central to successful merging in 3D tasks is the safeguarding of critical geometric and semantic signals:

Spectrum Preservation (Tran et al., 25 May 2024): PiToMe demonstrates that energy-based ordering and careful merging allow the Laplacian eigen-spectrum—which encodes structural and segmentation information—to be preserved up to a small perturbation during merging.
Spatial Consistency and Position Restoration (Mao et al., 30 Mar 2025, Huang et al., 24 Jun 2025): “Prune and Merge” with reconstruct matrices and spatial shortcut connections supports high-fidelity restoration of pruned spatial information. ToSA synchronizes merging for both visual and spatial token branches, offering layer-wise scheduled control over semantic vs. spatial criteria.
Token Diversity and Attentive Decoupling (Long et al., 2022): Dual-stage merging ensures that local attentive tokens (typically salient geometry, object edges) are preserved, while inattentive (often globally redundant) tokens are merged via density or clustering-based methods to preserve background diversity.

6. Future Directions and Open Challenges

Major research prospects in token merging for 3D vision include:

Dynamic, Content-Dependent Schedules: Extending fixed or layer-wise merging rates to content-driven, context-aware schedules (e.g., as in ATM, CA-ToMe), potentially for real-time scene understanding and autonomous systems.
End-to-End Learning of Merge Policies: Decoupled embedding modules (e.g., DTEM (Lee et al., 13 Dec 2024)) and differentiable, soft grouping operators enable adaptive merging, and future work may further unify content encoding and merge policy learning for 3D transformers.
3D-centric Graphical and Spectral Theories: Application of spectral graph and geometric reasoning (cf. PiToMe, GTP-ViT (Xu et al., 2023)) to better preserve segmentation, topology, and non-Euclidean geometric structures present in raw 3D data.
Extending to Real-Time and Edge Deployment: Hardware-friendly designs, caching (CA-ToMe), and norm/stability preserving interpolations (MLERP in ToFu) are critical for deployment in latency-constrained or memory-limited scenarios such as robotics, AR/VR, or connected vehicles.

7. Summary Table of Key 3D Token Merging Approaches

Approach	3D-Specific Innovations	Merging Criterion	Applications
PatchMerger (Renggli et al., 2022)	Fixed-output, position-independent, softmax routes	Attention w/learned W	ViT backbone, 3D patch tokens
TokenFusion (Wang et al., 2022)	Modality-aware substitution, residual PE	Importance + projection	RGB-depth, point cloud fusion, 3D detection
ToMe (Bolya et al., 2022)	Proportional attention size-aware matching	Cosine sim. + size	Video, audio, point cloud, multi-frame
PiToMe (Tran et al., 25 May 2024)	Energy-based, spectrum preserving	Energy-ordered BSM	Accelerated 3D ViTs, spectral fidelity
ToC3D (Zhang et al., 1 Sep 2024)	History query guided, dynamic routing	Attention to query	Multi-view 3D detection, nuScenes
ToSA (Huang et al., 24 Jun 2025)	Depth-informed, spatial pseudo-tokens	Spatial+semantic sim.	VQA, embodied QA, spatial reasoning
DTEM (Lee et al., 13 Dec 2024)	Decoupled, learned merging embedding	Differentiable adj.	Segmentation, captioning, point clouds
UMIFormer (Zhu et al., 2023)	Inter-view KNN, STM clustering	Edge features + KNN	Multi-view 3D reconstruction

This area remains under active development, with an increasing diversity of approaches for preserving both computational efficiency and the structural integrity necessary for robust 3D perception, reasoning, and generative modeling.