Papers
Topics
Authors
Recent
Search
2000 character limit reached

PiToMe: Spectrum-Preserving Token Merging

Updated 16 April 2026
  • PiToMe is a spectrum-preserving token merging framework that protects informative tokens using an energy score derived from spectral graph theory.
  • It reduces computational overhead by merging redundant tokens, achieving 40–60% FLOPs savings while limiting performance drops to ≤0.5%.
  • PiToMe refines Bipartite Soft Matching algorithms with theoretical guarantees on preserving token similarity graph spectra, enhancing efficiency across vision and language tasks.

PiToMe (“Protect Informative Tokens before Merging”) is a spectrum-preserving framework for accelerating Transformer-based models by merging redundant token representations while safeguarding informative tokens. By introducing an energy-based token selection strategy rooted in spectral graph theory, PiToMe achieves substantial reductions in computational overhead—saving 40–60% of floating-point operations (FLOPs) across language and vision tasks—while incurring minimal performance degradation (≤0.5% drops in standard benchmarks). PiToMe refines the Bipartite Soft Matching (BSM) family of token merging algorithms, providing both theoretical guarantees on the preservation of spectral properties in the token similarity graph and empirically validated improvements over existing merging methodologies such as ToMe, DiffRate, and ToFu (Tran et al., 2024).

1. Token Merging in Transformer Architectures

Transformer models exhibit compute and memory requirements that scale quadratically with the input sequence length NN, especially apparent in applications involving high-resolution images or long text sequences. Token merging frameworks were proposed to mitigate this computational bottleneck by reducing the effective number of tokens at intermediate layers. The generic BSM (Bipartite Soft Matching) framework operates via: (i) partitioning NN tokens into sets AA and BB (A=B)(|A| = |B|), (ii) computing pairwise affinities (typically cosine similarity) between tokens in AA and BB, (iii) selecting the top kk highest similarity pairs for merging via weighted averaging, and (iv) propagating the reduced token set to the next layer. Algorithms in this family—such as ToMe, ToFu, and DiffRate—differ in their partitioning heuristics and candidate ranking strategies. However, prior approaches have been sensitive to token-splitting schemes and have risked merging informative or unique tokens, especially in deeper layers, resulting in unnecessary performance loss (Tran et al., 2024).

2. Energy Score: Spectral Metric for Token Informativeness

PiToMe advances the BSM paradigm by introducing an energy score to quantify token informativeness. The energy score leverages the topology of the token similarity graph G=(V,W)G = (V, W), where edge weights W[i,j]=1cos(vi,vj)W[i, j] = 1 - \cos(v_i, v_j) reflect vector similarities in the embedding space. For a token NN0 with key vector NN1, the energy score is defined as

NN2

where NN3, NN4, and NN5 is a layer-dependent margin: NN6 (with NN7 denoting the current layer and NN8 the total number of layers). High energy scores correspond to tokens embedded in large, redundant clusters (e.g., backgrounds in images), while low scores identify isolated, potentially informative tokens, which PiToMe seeks to protect from merging (Tran et al., 2024).

3. PiToMe Algorithmic Pipeline

The PiToMe merging workflow proceeds as follows:

  1. Compute token key vectors NN9 from hidden token states AA0.
  2. Construct the token similarity graph via AA1.
  3. Calculate the energy score AA2 for each token using the piecewise-defined AA3.
  4. Let AA4, where AA5 is the target retain rate.
  5. Sort AA6 in descending order to obtain an index array AA7; designate the top AA8 tokens as mergeable and the bottom AA9 as protected.
  6. Partition the mergeable tokens into sets BB0 (odd indices) and BB1 (even indices).
  7. Using BSM, for each BB2 select the closest BB3 (by affinity), and merge via weighted averaging of their embeddings and patch counts.
  8. The merged and protected tokens are concatenated for propagation to the next layer, yielding BB4 tokens for subsequent processing (Tran et al., 2024).

4. Spectrum Preservation: Theoretical Guarantees

PiToMe formalizes token merging as a graph-coarsening operation, ensuring preservation of the graph Laplacian spectrum under mild assumptions. Let BB5 be the token similarity graph, and BB6 its coarsened form post-merging. Define BB7 and BB8 as the normalized Laplacians of BB9 and (A=B)(|A| = |B|)0; the spectral distance is

(A=B)(|A| = |B|)1

where (A=B)(|A| = |B|)2 is the Laplacian lifted to the original node space, and (A=B)(|A| = |B|)3 denotes eigenvalues. The key theorem asserts that, under intra-cluster cosine similarity converging to (A=B)(|A| = |B|)4 and well-separated clusters, (A=B)(|A| = |B|)5 as intra-cluster similarity increases, whereas random splitting (e.g., in ToMe) allows (A=B)(|A| = |B|)6 with nonzero probability. The proofs leverage row-wise difference bounds, ordered merging analysis, and matrix perturbation (Weyl's inequality) to establish spectral consistency (Tran et al., 2024).

5. Hyperparameters and Ablative Insights

Key hyperparameters for PiToMe include the reduction rate (A=B)(|A| = |B|)7 (typically (A=B)(|A| = |B|)8 or (A=B)(|A| = |B|)9), the number of merged tokens per layer AA0, the margin schedule AA1 (increasing selectivity in deeper layers), the scaling coefficient AA2, and the odd/even energy-based splitting for BSM pairing. Ablation experiments reveal that bypassing protection of low-energy tokens results in significantly higher accuracy degradation (up to AA3 versus AA4), while random versus sorted-energy splitting incurs AA5–AA6 percentage point losses. The energy score outperforms alternatives based on CLS-attention or mean attention in recall and accuracy. Fixed-AA7 versus fixed-AA8 scheduling shows fixed-AA9 is less FLOPs-efficient early on and yields additional accuracy loss (Tran et al., 2024).

6. Empirical Results Across Vision and Language Tasks

Comprehensive benchmarks demonstrate PiToMe's efficacy in both efficiency and task performance:

Task (Backbone) Baseline Score PiToMe Score (Reduction/Drop) Competing Methods
ImageNet-1k (ViT-MAE-H) 86.9% (top-1) 86.4% @ BB0 (−0.5%, ≈60% FLOPs) ToMe: 85.9% (−1.0%), DiffRate: 85.9% (−1.0%)
Flickr30k (CLIP-ViT-L) 572.24 (Rsum) 567.58 (−0.45%) @ 38.6 GViT-FLOPs ToMe: 564.10 (−1.5%), DiffRate: 564.03 (−1.6%)
VQA-v2 (LLaVA-7B) 76.6% 75.4% (−1.2%, −37% time) ToMe: 75.2% (−1.4%), DiffRate: 72.0% (−6.0%)
IMDb (BERT, 12L) 94.0% 93.2% (−0.8%, ×1.9 eval, ×1.8 train) ToMe: 93.3% (−0.7%), DiffRate: 92.4% (−1.6%)

Performance metrics consistently favor PiToMe, with acceleration factors of BB1–BB2 and accuracy drops limited to BB3 under optimal settings (Tran et al., 2024).

7. Context and Significance in the Transformer Ecosystem

PiToMe addresses key deficits in prior BSM-based token merging by robustly preserving informative content through spectral analysis. The spectrum-consistency guarantee distinguishes it from approaches reliant on random or heuristic splitting. Empirical superiority is observed across modalities (vision, text, multimodal), backbones (ViT, CLIP, BERT, LLaVA), and datasets (ImageNet, Flickr30k, VQA-v2, IMDb). A plausible implication is broader applicability in domains where computational cost is a primary constraint, given the maintained task accuracy and theoretical guarantees. PiToMe’s methodological advances highlight the importance of preserving structural properties (via spectrum) in compression and acceleration schemes for deep models (Tran et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PiToMe.