Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Transformers with Spectrum-Preserving Token Merging (2405.16148v2)

Published 25 May 2024 in cs.LG

Abstract: Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
  2. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  3. Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Advances in Neural Information Processing Systems, 36, 2024.
  4. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  5. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  6. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  7. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  8. Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, pages 35–49. Springer, 2022.
  9. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  10. Joint self-supervised image-volume representation learning with intra-inter contrastive clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 14426–14435, 2023.
  11. Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12309–12318, 2022.
  12. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  13. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European conference on computer vision, pages 620–640. Springer, 2022.
  14. Joint token pruning and squeezing towards more aggressive compression of vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2092–2101, 2023.
  15. Token merging: Your ViT but faster. In International Conference on Learning Representations, 2023.
  16. Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392, 2024.
  17. PuMer: Pruning and merging tokens for efficient vision language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12890–12903, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.721.
  18. Learned thresholds token merging and pruning for vision transformers. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  19. Diffrate : Differentiable compression rate for efficient vision transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17118–17128, 2023. doi: 10.1109/ICCV51070.2023.01574.
  20. What do self-supervised vision transformers learn? International Conference on Learning Representations, 2023.
  21. R Balakrishnan. The energy of a graph. Linear Algebra and its Applications, 387:287–295, 2004.
  22. Ivan Gutman and Bo Zhou. Laplacian energy of a graph. Linear Algebra and its applications, 414(1):29–37, 2006.
  23. Spectrally approximating large graphs with smaller graphs. In International conference on machine learning, pages 3237–3246. PMLR, 2018a.
  24. Graph coarsening with preserved spectral properties. In International Conference on Artificial Intelligence and Statistics, pages 4452–4462. PMLR, 2020a.
  25. Andreas Loukas. Graph reduction with spectral and cut guarantees. Journal of Machine Learning Research, 20(116):1–42, 2019a.
  26. Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. arXiv preprint arXiv:2305.17328, 2023a.
  27. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  28. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  29. Smyrf-efficient attention using asymmetric clustering. Advances in Neural Information Processing Systems, 33:6476–6489, 2020.
  30. Sub-linear memory: How to make performers slim. Advances in Neural Information Processing Systems, 34:6707–6719, 2021.
  31. Combiner: Full attention transformer with sparse computation cost. Advances in Neural Information Processing Systems, 34:22470–22482, 2021.
  32. Adaptive token sampling for efficient vision transformers. In European Conference on Computer Vision, pages 396–414. Springer, 2022.
  33. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023b.
  34. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021a.
  35. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021a.
  36. Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pages 3690–3699. PMLR, 2020a.
  37. Revisiting token dropping strategy in efficient bert pretraining. arXiv preprint arXiv:2305.15273, 2023.
  38. Focus on the core: Efficient attention via pruned token compression for document classification. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  39. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10809–10818, 2022a.
  40. Sp-vit: Learning 2d spatial priors for vision transformers. In The 33rd British Machine Vision Conference, 2022.
  41. Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction. arXiv preprint arXiv:2203.04570, 2022.
  42. Token merging for fast stable diffusion. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4599–4603, 2023.
  43. Token pooling in vision transformers for image classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 12–21, 2023.
  44. Spectral clustering with graph neural networks for graph pooling. In International conference on machine learning, pages 874–883. PMLR, 2020.
  45. Structural entropy guided graph hierarchical pooling. In International conference on machine learning, pages 24017–24030. PMLR, 2022.
  46. Featured graph coarsening with similarity guarantees. In International Conference on Machine Learning, pages 17953–17975. PMLR, 2023.
  47. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). In ICLR 2016, 2016.
  48. Graph Coarsening with Preserved Spectral Properties. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 4452–4462. PMLR, August 2020b.
  49. Andreas Loukas. Graph Reduction with Spectral and Cut Guarantees. Journal of Machine Learning Research, 20(116):1–42, 2019b.
  50. Spectrum Consistent Coarsening Approximates Edge Weights. SIAM Journal on Matrix Analysis and Applications, 44(3):1032–1046, 2023. doi: 10.1137/21M1458119.
  51. Spectrally Approximating Large Graphs with Smaller Graphs. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3237–3246. PMLR, July 2018b.
  52. Compression of weighted graphs. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 965–973, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 978-1-4503-0813-7. doi: 10.1145/2020408.2020566.
  53. Steve Butler. Interlacing for weighted graphs using the normalized Laplacian. The Electronic Journal of Linear Algebra, 16:90–98, 2007.
  54. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021a.
  55. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021a.
  56. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 17–23 Jul 2022a.
  57. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  58. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10602-1.
  59. Image-text retrieval: A survey on recent research and development. Thirty-First International Joint Conference on Artificial Intelligence (IJCAI)), 2022.
  60. Fourier transformer: Fast long range modeling by removing sequence redundancy with fft operator. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.570.
  61. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR, 23–29 Jul 2023.
  62. Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning, pages 5583–5594. PMLR, 2021.
  63. Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 982–997, 2021.
  64. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  65. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022.
  66. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
  67. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021b.
  68. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  69. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6693–6702, Los Alamitos, CA, USA, jun 2019. IEEE Computer Society. doi: 10.1109/CVPR.2019.00686.
  70. Vizwiz grand challenge: Answering visual questions from blind people. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018. doi: 10.1109/CVPR.2018.00380.
  71. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  72. Towards vqa models that can read. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8309–8318, 2019. doi: 10.1109/CVPR.2019.00851.
  73. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  74. Lmms-eval: Accelerating the development of large multimoal models, March 2024.
  75. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  76. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023. Accessed: 2023-05-21.
  77. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
  78. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021b.
  79. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  80. Cswin transformer: A general vision transformer backbone with cross-shaped windows, 2021.
  81. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022b.
  82. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), 2021b.
  83. A-ViT: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.
  84. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of naacL-HLT, 2019.
  85. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  86. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.
  87. Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pages 3690–3699. PMLR, 2020b.
  88. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35:24101–24116, 2022.
  89. Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 784–794, 2022.
  90. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  91. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  92. Henry Wolkowicz and George P. H. Styan. Bounds for eigenvalues using traces. Linear Algebra and its Applications, 29:471–506, 1980. ISSN 0024-3795. doi: https://doi.org/10.1016/0024-3795(80)90258-X.
  93. Hermann Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, December 1912. ISSN 1432-1807. doi: 10.1007/BF01456804.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets