Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CAST: Clustering Self-Attention using Surrogate Tokens for Efficient Transformers (2402.04239v1)

Published 6 Feb 2024 in cs.LG

Abstract: The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from $O(N2)$ to $O(\alpha N)$ where N is the sequence length, and {\alpha} is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Etc: Encoding long and structured inputs in transformers, 2020.
  2. Longformer: The long-document transformer, 2020.
  3. Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.
  4. Language models are few-shot learners, 2020.
  5. Distilling knowledge learned in bert for text generation, 2019. URL https://arxiv.org/abs/1911.03829.
  6. Classification of long sequential data using circular dilated convolutional neural networks, 2022.
  7. Generating long sequences with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509.
  8. Rethinking attention with performers, 2020. URL https://arxiv.org/abs/2009.14794.
  9. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  10. Hungry hungry hippos: Towards language modeling with state space models, 2023.
  11. Smyrf: Efficient attention using asymmetric clustering, 2020.
  12. An image is worth 16x16 words: Transformers for image recognition at scale, 2020. URL https://arxiv.org/abs/2010.11929.
  13. Efficiently modeling long sequences with structured state spaces, 2022.
  14. Liquid structural state-space models, 2022.
  15. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  16. Highly accurate protein structure prediction with alphafold. Nature, 596:583 – 589, 2021.
  17. Sparse factorization of large square matrices, 2021.
  18. Chordmixer: A scalable neural attention model for sequences with different lengths, 2023.
  19. Reformer: The efficient transformer, 2020. URL https://arxiv.org/abs/2001.04451.
  20. Alex Krizhevsky. Learning multiple layers of features from tiny images. pp.  32–33, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  21. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
  22. What makes convolutional models great on long sequence modeling?, 2022.
  23. Learning long-range spatial dependencies with horizontal gated-recurrent units. CoRR, abs/1805.08315, 2018. URL http://arxiv.org/abs/1805.08315.
  24. Transformer acceleration with dynamic sparse attention. CoRR, abs/2110.11299, 2021a. URL https://arxiv.org/abs/2110.11299.
  25. Swin transformer: Hierarchical vision transformer using shifted windows, 2021b. URL https://arxiv.org/abs/2103.14030.
  26. Effective approaches to attention-based neural machine translation, 2015. URL https://arxiv.org/abs/1508.04025.
  27. Luna: Linear unified nested attention, 2021. URL https://arxiv.org/abs/2106.01540.
  28. Mega: Moving average equipped gated attention, 2023.
  29. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pp.  142–150, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. ISBN 978-1-932432-87-9. URL http://dl.acm.org/citation.cfm?id=2002472.2002491.
  30. Derek Miller. Leveraging bert for extractive text summarization on lectures, 2019. URL https://arxiv.org/abs/1906.04165.
  31. OpenAI. Gpt-4 technical report, 2023.
  32. The acl anthology network corpus. Lang. Resour. Eval., 47(4):919–944, dec 2013. ISSN 1574-020X. doi: 10.1007/s10579-012-9211-2. URL https://doi.org/10.1007/s10579-012-9211-2.
  33. Simplified state space layers for sequence modeling, 2023.
  34. Utilizing bert for aspect-based sentiment analysis via constructing auxiliary sentence, 2019. URL https://arxiv.org/abs/1903.09588.
  35. Synthesizer: Rethinking self-attention in transformer models, 2020a. URL https://arxiv.org/abs/2005.00743.
  36. Sparse sinkhorn attention, 2020b. URL https://arxiv.org/abs/2002.11296.
  37. Long range arena: A benchmark for efficient transformers, 2020c.
  38. Mlp-mixer: An all-mlp architecture for vision, 2021.
  39. Llama: Open and efficient foundation language models, 2023.
  40. Attention is all you need, 2017.
  41. Linformer: Self-attention with linear complexity, 2020.
  42. Paramixer: Parameterizing mixing links in sparse factors works better than dot-product self-attention, 2022.
  43. Big bird: Transformers for longer sequences, 2021.
  44. Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks (IJCNN), pp.  1–4, 2015. doi: 10.1109/IJCNN.2015.7280459.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com