Neighbor-Aware Pruning (NAP)

Updated 4 January 2026

Neighbor-Aware Pruning (NAP) is a token reduction method that leverages local neighbor similarities to efficiently prune redundant elements in transformer models.
NAP computes adjacent token similarities to preserve critical local context while significantly lowering computational cost and memory usage in attention layers.
By integrating with models in vision, speech, and time-series domains, NAP offers a plug-and-play solution to accelerate inference without retraining.

Merging by Adjacent Token Similarity (MAT) is a class of token reduction techniques for Transformer and related sequential models in vision, speech, and time-series domains. MAT exploits the local similarity structure among adjacent tokens—those neighboring in input space, time, or reordering—by identifying and aggregating highly similar adjacent token pairs (or small neighborhoods). This local merging both preserves critical local context and dramatically reduces the computational cost of attention and subsequent model layers. Unlike global token matching, MAT achieves linear or near-linear scaling with the sequence length and can be applied without retraining, making it a practical plug-in for broad classes of models (Li et al., 28 Dec 2025).

1. Foundational Concepts and Motivation

The core observation underlying MAT is that many sequences processed by deep learning models—images, audio, videos, and time series—exhibit strong local redundancy. In images, for example, patches corresponding to spatially adjacent regions are often highly similar due to local continuity. In speech or time series, temporally adjacent segments typically encode slowly varying signals or repeated patterns. Traditional global merging algorithms (e.g., ToMe's bipartite matching) incur O(T²) cost and can blur non-local semantics, while MAT uses adjacency—often enforced by reordering (e.g., Hilbert curve for images (Li et al., 28 Dec 2025))—to guide efficient local aggregation. This approach is widely validated: MAT matches or exceeds global methods in recognition accuracy at moderate merge rates, while drastically reducing attention costs and preserving local contextual information.

2. Formal Similarity Measure and Token Selection Strategy

The defining step in MAT algorithms is computing a similarity score between adjacent token pairs (or small neighborhoods). The canonical similarity metric is cosine similarity: $\text{similarity}(x_i, x_j) = \frac{x_i^\top x_j}{\|x_i\|_2\,\|x_j\|_2}$ where $x_i$ and $x_j$ are the feature vectors of adjacent tokens in the current model layer. In vision, tokens are often pre-reordered (e.g., Hilbert curve) to maximize spatial neighborhood continuity so that adjacencies align with true semantic neighbors (Li et al., 28 Dec 2025). An analogous strategy applies across temporal dimensions for audio or sequential data (Li et al., 2023, Götz et al., 2024).

The merging procedure typically selects the top-r adjacent pairs per layer, or merges all pairs exceeding a similarity threshold τ. Overlapping merges are resolved using greedy interval detection to ensure each token participates in at most one aggregation per merge step. After selecting candidate pairs, merged representations are computed: $\tilde{x}_{(i,j)} = \alpha x_i + (1-\alpha) x_j$ Most published MAT algorithms use uniform averaging ( $\alpha = 0.5$ ), though weighting by similarity is also viable (Li et al., 28 Dec 2025).

3. Algorithmic Structure and Implementation

MAT modules are inserted between or within Transformer layers, usually after self-attention and before the feed-forward network. The canonical pipeline is:

Reorder tokens (vision: Hilbert curve; audio/time-series: temporal order).
Compute pairwise (local) similarities between all adjacent (or within-window) token pairs.
Rank or threshold similarities to select top-r pairs for merging.
Resolve overlapping merges via greedy interval detection.
Aggregate merged tokens via weighted or uniform average; non-merged tokens are passed through unchanged.
Update the token sequence for downstream attention/MLP blocks.
(Optional) Unmerge after global blocks if downstream dense heads require full sequence (Shu et al., 4 Dec 2025).

This structure allows for both static (fixed merge rates) and adaptive (dynamic threshold or per-layer merge count) schedules. MAT is typically computation-free at training time and can be applied at inference in a plug-and-play manner. For tasks demanding preservation of certain tokens (e.g., class tokens in ViTs), MAT accommodates protected token sets, which are excluded from merging (Li et al., 28 Dec 2025).

4. Domain-Specific Extensions and Variants

Vision Transformers (ViTs):

MAT in ViTs leverages spatial adjacency via row-major or space-filling curves (Hilbert, Z-order) for 1D sequence construction. The Hilbert curve ordering preserves more locality than row-major, enhancing efficiency and accuracy (Li et al., 28 Dec 2025).
Integration with 3D vision foundation models (e.g., VGGT) augments MAT with geometry-aware anchor selection based on Sobel-filtered gradient maps and local feature variance, selectively protecting tokens of geometric importance from merging and anchoring merges to low-importance spatial locations (Shu et al., 4 Dec 2025).

Speech and Time Series:

In speech models, MAT (A-ToMe) merges adjacent frames with highest cosine similarity, using either fixed merge ratios or similarity thresholds. In this domain, MAT significantly reduces sequence length early in the pipeline—up to 57% token reduction and ~70% GPU speedup without sacrificing transcription quality (Li et al., 2023).
For long-sequence time series and state-space models, MAT generalizes to local-neighborhood merging, where the merge window size (k) interpolates between pure adjacency (k=1, linear cost) and wider, near-global neighborhoods (k≈t/2, quadratic cost). Empirical evaluation on foundation models (e.g., Chronos) demonstrates that local merging yields up to 54× acceleration (Götz et al., 2024).

Video and Multi-frame Models:

MAT can operate across both spatial and temporal axes by, for example, merging tokens at the same spatial position across adjacent video frames when their features are highly similar (Fu et al., 2024). This provides a powerful mechanism for compressing redundant tokens in long sequences while cascading seamlessly into importance-based pruning.

5. Computational Complexity and Trade-Offs

MAT is distinguished by its asymptotic efficiency: all published MAT strategies restrict similarity computation to local neighborhoods, typically yielding O(T·C) cost per block (T = token count, C = feature dimension), compared to O(T²·C) for global merge schemes (Li et al., 28 Dec 2025). The total cost includes O(T·C) for similarity computation, O(T + r log T) for merge candidate selection, and minimal additional overhead for bookkeeping.

MAT’s speed-accuracy Pareto frontier is characterized by tunable merge ratios or thresholds. Empirical studies show:

Moderate merge rates (30–50%) yield near-baseline accuracy (≤0.1–0.3% degradation) in ViTs/DeiT across a range of scales (Li et al., 28 Dec 2025).
Aggressive token reduction (up to 70%) can result in modest accuracy drop (<3%) while providing 3–4× LLM speedups in video-LLMs (Fu et al., 2024).
For ASR models, WER remains unchanged or slightly improves at moderate merge rates (Li et al., 2023).

Domain-specific variants (e.g., geometry-aware anchors for 3D vision (Shu et al., 4 Dec 2025)) further optimize this trade-off by prioritizing perceptual or task-defined feature regions, ensuring essential structural information is preserved.

6. Experimental Results and Comparative Analysis

MAT achieves competitive or state-of-the-art accuracy-efficiency trade-offs across domains. In vision:

MAT matches or slightly exceeds ToMe and other global-matching methods at the same FLOPs on multiple ViT backbones, with improvements ranging from +0.1% to +0.4% Top-1 accuracy at moderate merge rates (Li et al., 28 Dec 2025).
MAT outperforms K-Medoids and LoTM at equal cost and is less sensitive to the choice of similarity threshold due to conflict-avoidance and locality.
The use of locality-preserving orderings (e.g., Hilbert) consistently improves resilience to merge-induced information loss (Li et al., 28 Dec 2025).

In 3D vision (LiteVGGT):

Geometry-aware anchor selection and merge index caching result in 10× latency reduction and ~40% memory savings with no loss—and even slight improvement—in geometric error (Chamfer distance, CD) compared to both pure MAT and unoptimized variants. Retention of edge and texture detail is substantially enhanced relative to pure MAT (Shu et al., 4 Dec 2025).

In time-series and speech, MAT enables linear scaling and can provide nontrivial denoising, as the merging operation acts as a low-pass filter in the spectral domain (Götz et al., 2024). The practical guideline is to merge tokens only if adjacent similarity exceeds a domain- and task-tuned threshold (typically 0.7–0.9), to avoid conflation of semantically distinct signals.

7. Limitations, Extensions, and Future Directions

MAT is most effective where local redundancy is strong and semantic discontinuities are sparse. Its performance may degrade in regions of rapid or non-local signal change, where local adjacency is a poor proxy for similarity. Geometry- or importance-aware variants can address this but require additional feature computation.

Several promising extensions include:

Data-dependent merge weights (e.g., softmaxed similarities), fully-differentiable merges, and learned merge scheduling.
Integration with global pruning, clustering, or entropy-based adaptive merge ratios, particularly for multi-modal or highly heterogeneous sequences (Shu et al., 4 Dec 2025, Fu et al., 2024).
Streaming and online variants for real-time inference in audio or event-driven domains (Li et al., 2023).
Causal (autoregressive) merging for transformer decoders in time series (Götz et al., 2024).

A plausible implication is that as model and sequence sizes continue to increase across domains, MAT and its domain-adapted extensions will remain a critical strategy for scaling, as well as for real-world deployment of transformer-class models under stringent efficiency constraints.