MAT: Merging by Adjacent Token Similarity

Updated 4 January 2026

MAT is a token reduction technique that merges adjacent tokens based on cosine similarity, reducing computational complexity from quadratic to linear.
It preserves local context by operating on spatial or temporal neighbors, ensuring minimal semantic loss in applications like vision, ASR, and time series.
Empirical results demonstrate significant speedups and memory savings in transformer architectures while maintaining or improving model accuracy across diverse domains.

Merging by Adjacent Token Similarity (MAT) is a token reduction technique developed to improve computational efficiency in transformer architectures by identifying and merging locally similar tokens. Unlike global token similarity methods, which require quadratic computation, MAT constrains similarity calculations to spatially or temporally adjacent tokens, exploiting data locality and thereby reducing both computational cost and information loss. MAT is widely used across vision transformers, large vision-LLMs, ASR encoders, and time series transformers, offering a linear-complexity alternative to global token matching.

1. Definition and Core Methodology

Merging by Adjacent Token Similarity (MAT) identifies pairs or small groups of adjacent tokens in a sequence whose representations are most similar by a chosen metric—typically cosine similarity—and merges them into a single representative token. The procedure is formally defined as follows (Li et al., 28 Dec 2025):

Let $m_i \in \mathbb{R}^C$ be the feature vector for token $i$ (after applying a spatial locality-preserving ordering, such as the Hilbert curve for images). For a mergeable token window $(p+j-1, p+j)$ (excluding $p$ protected tokens), the cosine similarity is: $A_j = \frac{m_{p+j-1}^\top m_{p+j}}{ \| m_{p+j-1} \|_2 \| m_{p+j} \|_2 }$ The MAT module computes similarities for all adjacent token pairs, selects the top $r$ (by similarity) for merging, avoiding overlapping intervals, and replaces each group of highly similar adjacent tokens with their (typically unweighted) average: $\tilde{m}_{(ij)} = \frac{1}{2}\left( m_i + m_j \right)$ Token assignment and merging bookkeeping ensure each token is merged at most once per layer. The sequence is rebuilt with protected and merged (or unmerged) tokens, allowing the subsequent transformer block to process fewer tokens and thus run in significantly reduced time.

2. Distinctive Features and Locality-Preserving Strategies

MAT is fundamentally distinguished by its restriction to adjacent (or nearly adjacent) tokens, tailored to domains where local continuity is a salient property—such as natural images, audio, or time series. In images, spatial adjacency is best preserved by reordering tokens along a Hilbert curve, which maintains 2D neighborhood relationships in 1D token sequences, ensuring that merges occur between semantically related regions (Li et al., 28 Dec 2025). In time series and speech, adjacency directly corresponds to temporal proximity (Götz et al., 2024, Li et al., 2023).

This locality constraint sharply reduces the required number of pairwise similarity computations from $O(T^2)$ to $O(T)$ (where $T$ is the sequence length), as only a linear number of comparisons are necessary. MAT does not rely on global clustering or expensive assignment, avoiding information loss at semantic boundaries by limiting merges to local contexts.

3. Algorithmic Variants and Domain-Specific Adaptations

Several notable algorithmic variants and enhancements of MAT have been proposed across domains:

Hilbert-Curve Reordering for Vision Transformers: Spatial tokens are reordered so adjacent indices in the sequence are also spatial neighbors, maximizing the efficacy of adjacent merging schemes (Li et al., 28 Dec 2025).
Geometry-Aware Anchor Selection for 3D Vision: LiteVGGT introduces geometry-aware anchor selection by scoring each token by a combination of edge magnitude (Sobel-filtered gradient map) and local semantic variability (variance map), protecting the top 10% of tokens and using the remainder for MAT-style merging. The merge assignments are cached and reused across layers to exploit interlayer similarity stability, amortizing the expensive assignment cost (Shu et al., 4 Dec 2025).
Temporal Adjacency in Video and Speech: FrameFusion merges tokens across video frames based on the cosine similarity between corresponding spatial locations in consecutive frames. Chains of highly similar tokens form transitive merge groups, which are then merged before downstream pruning (Fu et al., 2024). Adjacent Token Merging (A-ToMe) for ASR operates along the temporal axis, merging only strictly consecutive tokens with high key-value similarity (Li et al., 2023).
Local Radius Control in Time Series: MAT can generalize to a local window of radius $k$ , merging tokens within neighborhoods of size $k$ ; $k=1$ corresponds to strictly adjacent merging. Selection can be threshold-based (all pairs with similarity above $\tau$ ), or by fixed merge budget $r$ per layer (Götz et al., 2024).

4. Computational Complexity and Efficiency

By restricting merging to adjacent tokens, MAT achieves linear computational complexity in the number of tokens, a substantial improvement over the $O(T^2)$ complexity of global-pairwise schemes like ToMe. Let $T$ be the token count and $C$ the feature dimension:

Similarity calculation: $O(T C)$ per layer.
Top- $r$ selection (merge budget): $O(T + r \log T)$ .
Interval detection and merging bookkeeping: $O(r)$ .
Overall Layer Cost: $O(T C + T \log r)$ (Li et al., 28 Dec 2025).

The resulting reduction in token count, from $T$ to $T' = T - r$ , directly translates to subsequent attention layers operating at $O(T'^2 C)$ , yielding significant runtime and memory gains. For example, LiteVGGT achieves up to $10\times$ speedup and $40\%$ memory reduction in large-scale 3D reconstruction scenarios (Shu et al., 4 Dec 2025).

5. Empirical Performance and Accuracy-Efficiency Tradeoffs

Across domains, MAT achieves strong accuracy-efficiency tradeoffs by merging only sufficiently similar, locally adjacent tokens and preserving tokens critical to semantic or geometric fidelity. Empirical results include:

Vision Transformers: MAT matches or slightly exceeds the accuracy of global schemes like ToMe at moderate merge rates (e.g., 79.3% Top-1 accuracy for DeiT-S/224 with 3.3G FLOPs) while significantly reducing FLOPs and matching throughput, particularly as model size increases (Li et al., 28 Dec 2025).
3D Vision (LiteVGGT): When merging is guided by geometry-aware cues, LiteVGGT demonstrates improved geometric error (e.g., Chamfer Distance 0.428 on ScanNet-50) at $2\times$ the speed and $25\%$ lower memory usage relative to previous fast merging baselines (Shu et al., 4 Dec 2025).
Video LVLMs (FrameFusion): A $70\%$ reduction in vision tokens leads to $4.4\times$ LLM inference speedup and $1.9\times$ end-to-end speedup, with only a $2.4\%$ drop in retrieval accuracy (Fu et al., 2024).
ASR (A-ToMe): Up to $57\%$ of tokens merged, $70\%$ GPU speedup, and no notable WER increase at moderate thresholds (Li et al., 2023).
Time Series: For deep sequence models, MAT yields $2\times$ – $5\times$ speedup with $\approx 1\%$ MSE loss, and up to $50\times$ speedup in foundation models such as Chronos (Götz et al., 2024).

Ablation studies reveal that cosine similarity (computed on mean-pooled key vectors) yields the best accuracy, and that locality-preserving orderings (e.g., Hilbert curve) further improve results by aligning adjacency with true spatial neighborhood structure (Li et al., 28 Dec 2025).

6. Applications and Domain-Specific Considerations

MAT has been integrated into multiple domains and architectures:

Vision Transformers: Applied during or after each attention block; requires space-filling ordering for 2D inputs. MAT is differentiable (if required) and can be implemented in an off-the-shelf fashion at inference without retraining.
3D Vision and Video: Geometry-aware merging and temporal alignment, respectively. MAT is adapted to preserve structure and temporal continuity, critical for high-fidelity scene reconstruction or video understanding (Shu et al., 4 Dec 2025, Fu et al., 2024).
ASR and Time Series: Merging leverages temporal locality, with fixed- or adaptive-radius neighborhoods and controlled merge rates to preserve information content where it matters (Li et al., 2023, Götz et al., 2024).
Token Importance and Protection: Critical tokens (e.g., CLS tokens, or top-scoring tokens by geometric or semantic measures) are marked as protected and excluded from merging.

Empirical guidance includes careful tuning of merge rates or thresholds, as overly aggressive merging may degrade accuracy, especially in regions of high semantic or geometric complexity.

7. Relationship to Other Token Reduction Techniques

MAT is positioned as a linear-complexity alternative to global token merging or pruning methods. It contrasts with:

Global Pairwise Merging/Clustering: Schemes such as ToMe and K-Medoids require $O(T^2)$ comparisons and often ignore locality, potentially merging dissimilar or non-contiguous tokens (Li et al., 28 Dec 2025).
Hard Pruning Based on Importance: MAT complements or precedes pruning, as in FrameFusion, which merges highly similar tokens first, then prunes based on cumulative attention-based importance scores (Fu et al., 2024).
Hybrid and Domain-Guided Strategies: MAT can be augmented with geometric information (LiteVGGT), importance weighting, or clustering in feature space, though such modifications either increase computational cost or require retraining to avoid accuracy loss (Shu et al., 4 Dec 2025).

Overall, MAT exploits local continuity properties inherent to images, video, speech, and time series, delivering near-optimal accuracy at a fraction of the computational cost of global approaches. Its modularity and compatibility with existing architectures have made it a practical standard for efficient large-model deployment.