PiToMe: Spectrum-Preserving Token Merging
- PiToMe is a spectrum-preserving token merging framework that protects informative tokens using an energy score derived from spectral graph theory.
- It reduces computational overhead by merging redundant tokens, achieving 40–60% FLOPs savings while limiting performance drops to ≤0.5%.
- PiToMe refines Bipartite Soft Matching algorithms with theoretical guarantees on preserving token similarity graph spectra, enhancing efficiency across vision and language tasks.
PiToMe (“Protect Informative Tokens before Merging”) is a spectrum-preserving framework for accelerating Transformer-based models by merging redundant token representations while safeguarding informative tokens. By introducing an energy-based token selection strategy rooted in spectral graph theory, PiToMe achieves substantial reductions in computational overhead—saving 40–60% of floating-point operations (FLOPs) across language and vision tasks—while incurring minimal performance degradation (≤0.5% drops in standard benchmarks). PiToMe refines the Bipartite Soft Matching (BSM) family of token merging algorithms, providing both theoretical guarantees on the preservation of spectral properties in the token similarity graph and empirically validated improvements over existing merging methodologies such as ToMe, DiffRate, and ToFu (Tran et al., 2024).
1. Token Merging in Transformer Architectures
Transformer models exhibit compute and memory requirements that scale quadratically with the input sequence length , especially apparent in applications involving high-resolution images or long text sequences. Token merging frameworks were proposed to mitigate this computational bottleneck by reducing the effective number of tokens at intermediate layers. The generic BSM (Bipartite Soft Matching) framework operates via: (i) partitioning tokens into sets and , (ii) computing pairwise affinities (typically cosine similarity) between tokens in and , (iii) selecting the top highest similarity pairs for merging via weighted averaging, and (iv) propagating the reduced token set to the next layer. Algorithms in this family—such as ToMe, ToFu, and DiffRate—differ in their partitioning heuristics and candidate ranking strategies. However, prior approaches have been sensitive to token-splitting schemes and have risked merging informative or unique tokens, especially in deeper layers, resulting in unnecessary performance loss (Tran et al., 2024).
2. Energy Score: Spectral Metric for Token Informativeness
PiToMe advances the BSM paradigm by introducing an energy score to quantify token informativeness. The energy score leverages the topology of the token similarity graph , where edge weights reflect vector similarities in the embedding space. For a token 0 with key vector 1, the energy score is defined as
2
where 3, 4, and 5 is a layer-dependent margin: 6 (with 7 denoting the current layer and 8 the total number of layers). High energy scores correspond to tokens embedded in large, redundant clusters (e.g., backgrounds in images), while low scores identify isolated, potentially informative tokens, which PiToMe seeks to protect from merging (Tran et al., 2024).
3. PiToMe Algorithmic Pipeline
The PiToMe merging workflow proceeds as follows:
- Compute token key vectors 9 from hidden token states 0.
- Construct the token similarity graph via 1.
- Calculate the energy score 2 for each token using the piecewise-defined 3.
- Let 4, where 5 is the target retain rate.
- Sort 6 in descending order to obtain an index array 7; designate the top 8 tokens as mergeable and the bottom 9 as protected.
- Partition the mergeable tokens into sets 0 (odd indices) and 1 (even indices).
- Using BSM, for each 2 select the closest 3 (by affinity), and merge via weighted averaging of their embeddings and patch counts.
- The merged and protected tokens are concatenated for propagation to the next layer, yielding 4 tokens for subsequent processing (Tran et al., 2024).
4. Spectrum Preservation: Theoretical Guarantees
PiToMe formalizes token merging as a graph-coarsening operation, ensuring preservation of the graph Laplacian spectrum under mild assumptions. Let 5 be the token similarity graph, and 6 its coarsened form post-merging. Define 7 and 8 as the normalized Laplacians of 9 and 0; the spectral distance is
1
where 2 is the Laplacian lifted to the original node space, and 3 denotes eigenvalues. The key theorem asserts that, under intra-cluster cosine similarity converging to 4 and well-separated clusters, 5 as intra-cluster similarity increases, whereas random splitting (e.g., in ToMe) allows 6 with nonzero probability. The proofs leverage row-wise difference bounds, ordered merging analysis, and matrix perturbation (Weyl's inequality) to establish spectral consistency (Tran et al., 2024).
5. Hyperparameters and Ablative Insights
Key hyperparameters for PiToMe include the reduction rate 7 (typically 8 or 9), the number of merged tokens per layer 0, the margin schedule 1 (increasing selectivity in deeper layers), the scaling coefficient 2, and the odd/even energy-based splitting for BSM pairing. Ablation experiments reveal that bypassing protection of low-energy tokens results in significantly higher accuracy degradation (up to 3 versus 4), while random versus sorted-energy splitting incurs 5–6 percentage point losses. The energy score outperforms alternatives based on CLS-attention or mean attention in recall and accuracy. Fixed-7 versus fixed-8 scheduling shows fixed-9 is less FLOPs-efficient early on and yields additional accuracy loss (Tran et al., 2024).
6. Empirical Results Across Vision and Language Tasks
Comprehensive benchmarks demonstrate PiToMe's efficacy in both efficiency and task performance:
| Task (Backbone) | Baseline Score | PiToMe Score (Reduction/Drop) | Competing Methods |
|---|---|---|---|
| ImageNet-1k (ViT-MAE-H) | 86.9% (top-1) | 86.4% @ 0 (−0.5%, ≈60% FLOPs) | ToMe: 85.9% (−1.0%), DiffRate: 85.9% (−1.0%) |
| Flickr30k (CLIP-ViT-L) | 572.24 (Rsum) | 567.58 (−0.45%) @ 38.6 GViT-FLOPs | ToMe: 564.10 (−1.5%), DiffRate: 564.03 (−1.6%) |
| VQA-v2 (LLaVA-7B) | 76.6% | 75.4% (−1.2%, −37% time) | ToMe: 75.2% (−1.4%), DiffRate: 72.0% (−6.0%) |
| IMDb (BERT, 12L) | 94.0% | 93.2% (−0.8%, ×1.9 eval, ×1.8 train) | ToMe: 93.3% (−0.7%), DiffRate: 92.4% (−1.6%) |
Performance metrics consistently favor PiToMe, with acceleration factors of 1–2 and accuracy drops limited to 3 under optimal settings (Tran et al., 2024).
7. Context and Significance in the Transformer Ecosystem
PiToMe addresses key deficits in prior BSM-based token merging by robustly preserving informative content through spectral analysis. The spectrum-consistency guarantee distinguishes it from approaches reliant on random or heuristic splitting. Empirical superiority is observed across modalities (vision, text, multimodal), backbones (ViT, CLIP, BERT, LLaVA), and datasets (ImageNet, Flickr30k, VQA-v2, IMDb). A plausible implication is broader applicability in domains where computational cost is a primary constraint, given the maintained task accuracy and theoretical guarantees. PiToMe’s methodological advances highlight the importance of preserving structural properties (via spectrum) in compression and acceleration schemes for deep models (Tran et al., 2024).