Multi-criteria Token Fusion in ViTs
- MCTF is an advanced token reduction method that integrates similarity, informativeness, and token size criteria to guide effective token fusion in Vision Transformers.
- It employs a composite attraction score and a predictive one-step-ahead attention mechanism to minimize information loss during token merging.
- Empirical results show that MCTF reduces self-attention FLOPs significantly while maintaining or slightly improving accuracy across various ViT backbones.
Multi-criteria Token Fusion (MCTF) is an advanced token reduction technique for Vision Transformers (ViTs) that addresses the efficiency-accuracy dilemma inherent in reducing the quadratic computational cost of self-attention. Unlike prior approaches that employ either token pruning (dropping tokens) or single-criterion fusing (typically similarity-based), MCTF integrates multiple semantically meaningful criteria to guide token fusion. It further incorporates a predictive “one-step-ahead” attention mechanism for robust informativeness estimation and introduces a consistency-based fine-tuning objective. Empirical evaluations demonstrate that MCTF consistently ameliorates the conventional speed-accuracy trade-off and is applicable across a range of ViT backbones (Lee et al., 2024).
1. Motivation and Background
ViTs exhibit complexity per layer, with tokens and embedding dimension , principally due to the global self-attention mechanism. Token reduction methods (pruning or merging) are requisite for deployment on compute-constrained platforms or for increasing data throughput. Standard single-criterion fusion strategies, such as those based solely on token similarity, frequently result in excessive collapsing of semantically distinct regions or indiscriminate elimination of critical tokens. This exacerbates the loss of representational fidelity and yields a significant drop in downstream performance as token reduction grows aggressive. MCTF is developed to fuse tokens while minimizing information loss by jointly considering similarity, informativeness, and token size, as well as anticipating information flow using the subsequent layer’s attention map (Lee et al., 2024).
2. Multi-Criteria Token Attraction
MCTF determines token merge candidates via a composite “attraction score” , synthesizing three distinct criteria: similarity, informativeness, and size of merged tokens. For input tokens , the attraction score is:
where modulate the influence of each criterion. The constituent weights are:
- Similarity:
Normalizes cosine similarity to , promoting the fusion of redundant tokens.
- Informativeness:
0
where 1 is the attention map of the next (2) layer, and 3 indicates average attention received. Tokens with low global influence (i.e., low 4) are easier to merge.
- Size:
5
where 6 is the size of a super-token (initialized as 1), deterring excessive fusion that would form regionally dominant tokens (Lee et al., 2024).
The fusion preference is thus not dictated by a solitary metric, but by a tempered combination that guards redundancy, saliency, and spatial granularity.
3. One-Step-Ahead Attention for Informativeness
Informative token preservation is critical to prevent the collapse of task-relevant features. Previous works relied on the current layer’s attention matrix 7 to estimate informativeness. MCTF observes that attention patterns can drift significantly across layers, causing information-critical tokens to be mistakenly fused. By pivoting to the “one-step-ahead” attention matrix 8—computed with the tokens as expected for the next layer, immediately after fusion—MCTF achieves a more accurate estimate of which tokens will have high impact in subsequent computation:
9
where 0 are the query and key projections after candidate token fusion. For practical efficiency, the attention values for fused tokens are approximated by weighted aggregation, circumventing the need for an additional quadratic attention calculation (Lee et al., 2024).
4. Fusion Procedure and Algorithm
The fusion algorithm proceeds as follows for each layer:
- Randomly partition the input tokens into two subsets of size 1.
- Compute 2 for all 3 pairs using the multi-criteria formula (self-loops excluded).
- For each 4, select its highest-scoring 5 partner; from these, identify the top-6 pairs globally.
- For each selected 7, merge their tokens as a weighted average:
8
where 9 and 0 are the informativeness and size weights for each constituent.
- Replace the originals in both subsets with the fused tokens, yielding 1 tokens.
- Repeat bipartite matching in the reverse direction to balance matchings.
This soft bipartite matching ensures local optimality and bidirectionality, yielding a reliable reduction with limited approximation error. The cost is 2 per layer for matching—a negligible overhead relative to the savings from self-attention with a reduced token count (Lee et al., 2024).
5. Token Reduction Consistency in Training
To stabilize the model across various token reduction ratios at inference, MCTF introduces a token-reduction consistency (TRC) loss during fine-tuning. Let 3 be the model output for reduction level 4:
5
where 6 is sampled from 7 per batch, 8 denotes the class token embedding prior to the final prediction head, and CE is the cross-entropy loss. This regularizes representation alignment under different reduction intensities, bolstering performance consistency and robustness (Lee et al., 2024).
6. Computational Complexity and Empirical Performance
MCTF reduces self-attention FLOPs from 9 to 0 per layer, with bipartite matching accounting for only 1 additional operations. Empirical results on ImageNet-1K with different ViT architectures are as follows:
| Backbone | Baseline Accuracy (%) | MCTF Accuracy (%) | Baseline GFLOPs | MCTF GFLOPs | Speedup (%) | Accuracy Gain (%) |
|---|---|---|---|---|---|---|
| DeiT-Tiny | 72.2 | 72.7 | 1.20 | 0.71 | 43.6 | +0.5 |
| DeiT-Small | 79.8 | 80.1 | 4.60 | 2.60 | 43.5 | +0.3 |
| T2T-ViT-t14 | 81.7 | 81.8 | 6.10 | 4.19 | 31.4 | +0.1 |
| LV-ViT-Small | 83.3 | 83.4 | 6.60 | 4.16 | 36.0 | +0.1 |
MCTF thus not only preserves accuracy under large token reductions but also, in the DeiT family, yields modest accuracy improvements despite a ≈44% reduction in FLOPs (Lee et al., 2024).
7. Relation to and Distinction from Other Token Fusion Techniques
Alternative token reduction strategies, such as Token Fusion (ToFu) (Kim et al., 2023), hybridize token pruning and merging based on functional linearity and norm-preserving “MLERP” merging. However, ToFu’s multi-criteria pipeline is limited to two decision axes (functional linearity threshold and optional sensitivity), whereas MCTF leverages a joint attraction over similarity, informativeness, and token size, and anchors informativeness with predictive attention rather than only historical. Unlike MCTF’s token reduction consistency, ToFu addresses zero-training deployment but recommends fine-tuning for optimal accuracy. Both approaches achieve substantial throughput gains without parameter increase. A plausible implication is that the MCTF’s explicit multi-criteria fusion and predictive informativeness estimation may offer more robust representational preservation under aggressive token count reduction (Lee et al., 2024, Kim et al., 2023).