Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TaylorShift: Shifting the Complexity of Self-Attention from Squared to Linear (and Back) using Taylor-Softmax (2403.02920v2)

Published 5 Mar 2024 in cs.LG and cs.AI

Abstract: The quadratic complexity of the attention mechanism represents one of the biggest hurdles for processing long sequences using Transformers. Current methods, relying on sparse representations or stateful recurrence, sacrifice token-to-token interactions, which ultimately leads to compromises in performance. This paper introduces TaylorShift, a novel reformulation of the Taylor softmax that enables computing full token-to-token interactions in linear time and space. We analytically determine the crossover points where employing TaylorShift becomes more efficient than traditional attention, aligning closely with empirical measurements. Specifically, our findings demonstrate that TaylorShift enhances memory efficiency for sequences as short as 800 tokens and accelerates inference for inputs of approximately 1700 tokens and beyond. For shorter sequences, TaylorShift scales comparably with the vanilla attention. Furthermore, a classification benchmark across five tasks involving long sequences reveals no degradation in accuracy when employing Transformers equipped with TaylorShift. For reproducibility, we provide access to our code under https://github.com/tobna/TaylorShift.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Linear complexity self-attention with 3rd order polynomials. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.  1–12, 2023. doi: 10.1109/tpami.2022.3231971.
  2. Exploring alternatives to softmax function. CoRR, abs/2011.11538, 2020.
  3. Scaling transformer to 1m tokens and beyond with rmt. 2023. doi: 10.48550/ARXIV.2304.11062.
  4. Rethinking attention with performers. In International Conference on Learning Representations, 2021.
  5. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285.
  6. Vitality: Unifying low-rank and sparse approximation for vision transformer acceleration with a linear taylor attention. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp.  415–428, Los Alamitos, CA, USA, mar 2023. IEEE Computer Society. doi: 10.1109/HPCA56546.2023.10071081.
  7. An exploration of softmax alternatives belonging to the spherical loss family. In Bengio, Y. and LeCun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  8. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. doi: 10.1109/cvpr.2009.5206848.
  9. Xcit: Cross-covariance image transformers. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.
  10. A practical survey on faster and lighter transformers. ACM Comput. Surv., 3 2023. ISSN 0360-0300. doi: 10.1145/3586074.
  11. Pruning convolution neural network (squeezenet) using taylor expansion-based criterion. In 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 2018. doi: 10.1109/isspit.2018.8705095.
  12. Explaining convolutional neural networks using softmax gradient layer-wise relevance propagation. In Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, pp.  4176–4185, United States, October 2019. Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/ICCVW.2019.00513.
  13. Transformers are rnns: Fast autoregressive transformers with linear attention. International Conference on Machine Learning, pp. 5156–5165, 2020.
  14. On the computational complexity of self-attention. In International Conference on Algorithmic Learning Theory, pp.  597–619. PMLR, 2023.
  15. A survey of transformers. AI Open, 3:111–132, 2022. ISSN 2666-6510. doi: 10.1016/j.aiopen.2022.10.001.
  16. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.  9992–10002, Los Alamitos, CA, USA, 10 2021. IEEE Computer Society. doi: 10.1109/ICCV48922.2021.00986.
  17. Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  18. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017.
  19. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, May 2015. ISSN 0031-3203. doi: 10.1016/j.patcog.2016.11.008.
  20. ListOps: A diagnostic dataset for latent tree learning. In Cordeiro, S. R., Oraby, S., Pavalanathan, U., and Rim, K. (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.  92–99, New Orleans, Louisiana, USA, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-4013.
  21. Which transformer to favor: A comparative analysis of efficiency in vision transformers, 2023.
  22. Taylorformer: Probabalistic modelling for random processes including time series. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023.
  23. Mb-taylorformer: Multi-branch efficient transformer expanded by taylor formula for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  12802–12813, October 2023.
  24. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2021.
  25. Efficient transformers: A survey. ACM Comput. Surv., 4 2022. doi: 10.1145/3530811.
  26. Deit iii: Revenge of the vit. In Avidan, S., Brostow, G., Cissé, M., Farinella, G. M., and Hassner, T. (eds.), Computer Vision – ECCV 2022, pp.  516–533, Cham, 2022. Springer Nature Switzerland.
  27. Focused transformer: Contrastive training for context scaling. In Thirty-seventh Conference on Neural Information Processing Systems. arXiv, 2023.
  28. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  29. Efficient exact gradient update for training deep networks with very large sparse targets. Advances in Neural Information Processing Systems, 28, 2015.
  30. Linformer: Self-attention with linear complexity, June 2020.
  31. Using taylor expansion and convolutional sparse representation for image fusion. Neurocomputing, 402:437–455, 2020. ISSN 0925-2312. doi: 10.1016/j.neucom.2020.04.002.
  32. Big bird: Transformers for longer sequences. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  33. Taylornet: A taylor-driven generic neural architecture, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.