Abstract: We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.
The paper introduces Token Merging (ToMe) as a technique that reduces redundant tokens to double throughput with only a 0.2–0.3% accuracy drop.
It employs a bipartite soft matching algorithm and proportional attention to merge tokens efficiently during both inference and training.
Empirical results across images, video, and audio confirm that ToMe significantly speeds up ViT models, offering a practical balance between speed and accuracy.
The paper "Token Merging: Your ViT But Faster" (2210.09461) introduces Token Merging (ToMe), a method for increasing the throughput of Vision Transformer (ViT) models without retraining. ToMe combines similar tokens in a transformer using a matching algorithm. The paper demonstrates that ToMe can double the throughput of ViT-L at 512 and ViT-H at 518 on images and increase the throughput of ViT-L on video by 2.2x, with an accuracy drop of 0.2-0.3\% in each case. The paper also shows that ToMe can be applied during training, improving training speed by up to 2x for Masked Autoencoder (MAE) fine-tuning on video.
Here's a more detailed breakdown:
Introduction
The paper addresses the efficiency challenges of vanilla ViTs compared to domain-specific transformer hybrids like Swin, MViT, and LeViT. While vanilla ViTs offer simplicity, support for self-supervised pre-training (e.g., MAE), and cross-modality applicability, their computational demands can be high. The paper introduces ToMe as a way to prune tokens at runtime, enabling faster models without the disadvantages of token pruning such as information loss, the requirement for re-training, and infeasibility of batched inference.
Related Work
The paper discusses existing approaches to improve transformer efficiency, including faster attention mechanisms, pruning heads or features, and domain-specific modules. It distinguishes ToMe from token reduction methods, which require training, and highlights its ability to be applied during both inference and training. The paper also addresses the limited work on combining tokens, noting that previous approaches have not offered a reasonable speed-accuracy trade-off without training.
Token Merging
The paper details ToMe, which merges tokens in each block of a transformer to reduce by r tokens per layer. This gradually merges rL tokens over L blocks in the network, offering a speed-accuracy trade-off based on the value of r. The token merging step is applied between the attention and Multilayer Perceptron (MLP) branches of each transformer block.
To determine token similarity, the paper uses a dot product similarity metric between the keys (K) of each token, derived from the QKV self-attention mechanism. The bipartite soft matching algorithm is then used to efficiently determine which tokens to match, avoiding iterative processes and promoting gradual changes. The algorithm involves partitioning tokens into two sets, drawing edges from each token in one set to its most similar token in the other, keeping the r most similar edges, and merging connected tokens by averaging their features.
The paper also introduces proportional attention, which adjusts the softmax attention calculation based on token size (s), which represents the number of patches a token represents. The formula is:
softmax(d+log(s)QKT)
where:
Q is the query matrix
K is the key matrix
d is the dimension of the key matrix
s is the row vector containing the size of each token
When training with ToMe, the method is treated as a pooling operation, and gradients are backpropagated through the merged tokens.
Image Experiments
The paper presents experiments on ImageNet-1k using ViT models trained with AugReg, MAE, SWAG, and DeiT. Throughput is measured during inference on a V100 GPU. Ablation studies validate design choices, including using the attention keys (K) for token similarity, cosine similarity as the distance function, and weighted averaging for combining tokens.
Key findings include:
Using attention keys (K) for merging is more accurate than using token features (X).
Cosine similarity is the best choice for speed and accuracy in measuring token distance.
Proportional attention is necessary for supervised models but not for MAE models.
Bipartite matching achieves a balance between accuracy and speed compared to pruning or clustering algorithms.
A constant merging schedule is close to optimal compared to randomly sampled schedules.
The paper applies ToMe to several state-of-the-art off-the-shelf ViT models, varying r to construct throughput vs. accuracy curves. Results show that a constant schedule can double the throughput, with larger models experiencing minimal accuracy drops.
Comparison to Other Works
The paper compares ToMe to state-of-the-art models trained on ImageNet-1k, including EfficientNet, Swin, CSWin, and MViTv2. Results indicate that ToMe improves the throughput of ViT models, making them comparable in speed to models of a lower tier without scaling the number of features. The paper also compares ToMe to token pruning works, demonstrating competitive performance and exceeding the throughput of existing token pruning methods.
Visualizations
The paper includes visualizations of merged tokens, showing that ToMe merges parts of objects, resembling part segmentation.
Video Experiments
The paper applies ToMe to Spatiotemporal MAE for video classification on Kinetics-400. Results show that ToMe can match the throughput of Swin-B while outperforming MViTv2-L, even without training.
Key findings include:
ToMe can double throughput with a negligible accuracy drop
The method cuts training time in half.
The same object or part is merged into one across multiple frames of video.
Audio Experiments
The paper presents experiments on an Audio MAE, where a spectrogram of the audio signal is rasterized and fed into a ViT model, evaluated on AudioSet-2M. Results show that ToMe can double the throughput with a minimal mAP drop.
Conclusion
The paper concludes by highlighting ToMe's ability to increase the throughput of ViT models by merging tokens, exploiting input redundancy across modalities. It suggests that ToMe could be combined with other methods or applied to tasks like segmentation and could be a component of training large models.