Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning (2408.06798v1)

Published 13 Aug 2024 in cs.CV

Abstract: Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: https://github.com/JieShibo/ToCom

Summary

The paper demonstrates that Token Compression (ToCom) enhances semantic segmentation performance by compressing tokens before FFNs and decompressing them afterward.
It employs a DeiT-B384 encoder and Mask Transformer decoder on the ADE20k dataset, achieving consistent mIoU improvements across various token sampling rates.
The approach maintains computational efficiency (GFLOPs) while boosting performance, offering a promising strategy for real-time and resource-limited applications.

An Examination of Token Compression for Dense Prediction Tasks

The paper authored by Shibo Jie, Yehui Tang, and Zhi-Hong Deng investigates a novel method for token compression termed "Token Compression" (ToCom) and its application in dense prediction tasks such as semantic segmentation. This summary explores the methodology, experimental setup, results, and potential implications of their findings.

Token Compression and Semantic Segmentation

In dense prediction tasks, the need for predictions on all tokens imposes substantial computational demands. The authors address this challenge by introducing ToCom, a technique for compressing tokens before the Feed-Forward Network (FFN) layers and decompressing them afterward. This approach aims to reduce the computational load while maintaining performance.

Experimental Methodology

The authors conducted experiments on the ADE20k dataset using a pre-trained DeiT-B $_{384}$ as the encoder and the Mask Transformer from the Segmenter model as the decoder. They trained the models by varying the sampling rate ( $r$ ) and merging 32 $r$ tokens before the FFNs. The ADE20k image resolution was set at 512x512. The following points summarize their experimental details:

Dataset: ADE20k resized to 512x512
Training: Reduced to 2 epochs
Evaluation: Single-scale evaluation with $r$ values set at 0 (source) and 8, 12, 16 (target)

Performance Evaluation

The results, outlined in Table \ref{tab:seg} of the paper, indicate the benefits of ToCom on the validation set of ADE20k. The key performance metric used is the mean Intersection over Union (mIoU), with GFLOPs measured for computational efficiency.

| Target $r$ | mIoU (SS) | GFLOPs | ||--|--| | 0 | 48.7 | 106.2 | | 8 (baseline) | 48.0 | 91.8 | | 8 (+ToCom) | 48.3 | 91.8 | | 12 (baseline) | 46.4 | 84.5 | | 12 (+ToCom) | 47.2 | 84.5 | | 16 (baseline) | 41.3 | 77.3 | | 16 (+ToCom) | 43.4 | 77.3 |

The results show that ToCom consistently improves the mIoU over the baseline at all target r values while maintaining the same level of computational efficiency in terms of GFLOPs.

Implementation and Hyperparameter Tuning

The authors provided extensive details on their implementation approach, using PyTorch and TIMM libraries, running experiments on GPUs, and optimizing hyperparameters for different datasets. The thoroughness in methodology ensures reproducibility and robustness of the results.

Implications and Future Directions

The findings indicate that ToCom is a viable method for reducing computational requirements in dense prediction tasks without a significant hit on performance. This has promising implications for real-time applications and systems with limited computational resources. Furthermore, the potential extension of ToCom to various other vision tasks beyond semantic segmentation could further enhance its applicability in practical scenarios.

Future research could focus on:

Adapting ToCom for other dense prediction tasks like instance segmentation and object detection.
Experimenting with different token compression and decompression strategies.
Evaluating the benefits of ToCom in a broader range of models and architectures.

Conclusion

Overall, the paper presents a compelling case for the efficacy of token compression in dense prediction tasks. Through rigorous experimentation and detailed performance analysis, the authors demonstrate ToCom's potential to improve computational efficiency while preserving or enhancing model performance. This technique offers a promising avenue for further research and development in the field of computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - JieShibo/ToCom: [ECCV 2024] Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning (9 stars)

Tweets

https://twitter.com/CSVisionPapers/status/1823838848201204148