Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gated Linear Attention Transformers with Hardware-Efficient Training (2312.06635v6)

Published 11 Dec 2023 in cs.LG and cs.CL
Gated Linear Attention Transformers with Hardware-Efficient Training

Abstract: Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a standalone layer even on short sequence lengths (e.g., 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer (Touvron et al., 2023) as well recent linear-time-inference baselines such as RetNet (Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale LLMing experiments. GLA Transformer is especially effective at length generalization, enabling a model trained on 2K to generalize to sequences longer than 20K without significant perplexity degradations. For training speed, the GLA Transformer has higher throughput than a similarly-sized Mamba model.

Gated Linear Attention Transformers with Hardware-Efficient Training

Overview

The paper "Gated Linear Attention Transformers with Hardware-Efficient Training" presents advancements in Transformer architectures that leverage linear attention mechanisms to improve computational efficiency, particularly in hardware-limited environments. The proposed Gated Linear Attention (GLA) Transformer introduces a hardware-efficient algorithm for linear attention that strategically manages memory movement and parallelizability. This approach, known as FlashLinearAttention, is benchmarked against softmax attention-based Transformers and other linear attention variants, showing competitive performance on both training speed and model accuracy.

Contributions

The primary contributions of the paper include:

  1. FlashLinearAttention Algorithm: The paper introduces FlashLinearAttention, a novel linear attention algorithm optimized for hardware efficiency. It addresses the inefficiencies of prior linear attention methods by avoiding excessive memory movements and better utilizing parallel computation resources.
  2. Data-Dependent Gating Mechanism: The text extends linear attention with data-dependent gates, creating Gated Linear Attention (GLA). This mechanism replaces the fixed global decay rate in traditional models with a more expressive, data-aware variant that improves model flexibility and performance.
  3. Empirical Benchmarking: Extensive experiments are conducted to validate the GLA Transformer against existing models, such as LLaMA, RetNet, and Mamba, and on various benchmarks. The results indicate that GLA Transformers match or exceed the performance of these baselines on LLMing tasks and exhibit strong length generalization capabilities.

Technical Details

FlashLinearAttention

FlashLinearAttention achieves hardware efficiency through two key strategies:

  1. Tiling and Memory Hierarchy Awareness: The algorithm breaks down computations into tiles that fit into fast, on-chip memory (SRAM), significantly reducing the reliance on slower global memory (HBM).
  2. Parallel and Sequential I/O Operations: Depending on memory constraints, it employs either a materialization approach, holding intermediary states in HBM for increased parallelism, or a non-materialization approach, recomputing states to save memory at the cost of additional computation.

Gated Linear Attention (GLA)

GLA introduces a gating mechanism that dynamically adjusts based on the input data:

  1. Matrix-Valued Gates: Instead of using a fixed decay factor, GLA uses data-dependent gates calculated through a low-rank linear transformation followed by a sigmoid function. This allows finer control over the retention of information across time steps.
  2. Parallel Computation Form: The paper also establishes a parallel form for GLA, demonstrating how efficient chunkwise parallel computation can be achieved despite the complexity added by the gates.

Empirical Results

The empirical evaluation of the GLA Transformer encompasses several dimensions:

  1. Synthetic Tasks: The Multi-Query Associative Recall (MQAR) task shows that GLA outperforms scalar decay-based models like RetNet, validating the effectiveness of the data-dependent gating mechanism.
  2. LLMing: On LLMing benchmarks, GLA Transformers exhibit competitive perplexity and accuracy, closely matching or outperforming the state-of-the-art, including the LLaMA architecture.
  3. Training Efficiency: GLA Transformers offer superior training throughput compared to Mamba and traditional Transformers, particularly when leveraging the materialization strategy for handling longer sequences.

Future Directions

The findings in this paper point towards several future research avenues:

  1. Scaling Up: Given the promising empirical results at moderate scales, the next step involves scaling GLA Transformers to larger models and datasets to explore their potential at industry-relevant scales.
  2. Cross-Modal Applications: Extending GLA mechanisms to other domains, such as vision and audio, where long-range dependencies are critical, could further validate its versatility and efficiency.
  3. Further Optimization: Continued enhancements in hardware-aware algorithms, potentially integrating emerging memory technologies or specialized computation units, could further improve the efficiency and performance of GLA Transformers.

Conclusion

The paper offers a significant step forward in the development of efficient Transformer architectures by integrating gated mechanisms into linear attention frameworks and optimizing their implementation for hardware. The GLA Transformer, underpinned by the FlashLinearAttention algorithm, presents a compelling alternative to conventional models, balancing computational efficiency and modeling power. This work opens new pathways for deploying large-scale neural models in resource-constrained environments, maintaining high performance standards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. The sciqa scientific question answering benchmark for scholarly knowledge. Scientific Reports, 13(1):7240, May 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-33607-z. URL https://doi.org/10.1038/s41598-023-33607-z.
  2. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.
  3. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  4. Guy E. Blelloch. Prefix sums and their applications. 1990. URL https://api.semanticscholar.org/CorpusID:60459178.
  5. Striped attention: Faster ring attention for causal transformers. ArXiv, abs/2311.09431, 2023. URL https://api.semanticscholar.org/CorpusID:265220849.
  6. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  7. 3.2 the a100 datacenter gpu and ampere architecture. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pp.  48–50, 2021. doi: 10.1109/ISSCC42613.2021.9365803.
  8. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  9. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  11. Accelerating reduction and scan using tensor core units. In Rudolf Eigenmann, Chen Ding, and Sally A. McKee (eds.), Proceedings of the ACM International Conference on Supercomputing, ICS 2019, Phoenix, AZ, USA, June 26-28, 2019, pp.  46–57. ACM, 2019. doi: 10.1145/3330345.3331057. URL https://doi.org/10.1145/3330345.3331057.
  12. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023. doi: 10.48550/ARXIV.2307.08691. URL https://doi.org/10.48550/arXiv.2307.08691.
  13. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html.
  14. Flashfftconv: Efficient convolutions for long sequences with tensor cores. CoRR, abs/2311.05908, 2023. doi: 10.48550/ARXIV.2311.05908. URL https://doi.org/10.48550/arXiv.2311.05908.
  15. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  16. Learning to forget: Continual prediction with LSTM. Neural Comput., 12(10):2451–2471, 2000. doi: 10.1162/089976600300015015. URL https://doi.org/10.1162/089976600300015015.
  17. Mamba: Linear-time sequence modeling with selective state spaces. 2023. URL https://api.semanticscholar.org/CorpusID:265551773.
  18. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id=uYLFoz1vlAC.
  19. Efficiently modeling long sequences with structured state spaces, 2022b.
  20. Franz A. Heinsen. Efficient parallelization of an ubiquitous sequential computation. 2023. URL https://api.semanticscholar.org/CorpusID:265149785.
  21. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pp.  177–186, 1987.
  22. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  23. Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  9099–9117. PMLR, 2022. URL https://proceedings.mlr.press/v162/hua22a.html.
  24. Going beyond linear transformers with recurrent fast weight programmers. Advances in Neural Information Processing Systems, 34:7703–7717, 2021.
  25. Finetuning pretrained transformers into rnns. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp.  10630–10643. Association for Computational Linguistics, 2021a. doi: 10.18653/V1/2021.EMNLP-MAIN.830. URL https://doi.org/10.18653/v1/2021.emnlp-main.830.
  26. Finetuning pretrained transformers into RNNs. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10630–10643, Online and Punta Cana, Dominican Republic, November 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.830. URL https://aclanthology.org/2021.emnlp-main.830.
  27. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  28. Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. ArXiv, abs/2311.01927, 2023. URL https://api.semanticscholar.org/CorpusID:265018962.
  29. tcfft: Accelerating half-precision FFT through tensor cores. CoRR, abs/2104.11471, 2021. URL https://arxiv.org/abs/2104.11471.
  30. Lightseq: Sequence level parallelism for distributed training of long context transformers. ArXiv, abs/2310.03294, 2023a. URL https://api.semanticscholar.org/CorpusID:263671659.
  31. Sequence parallelism: Long sequence training from system perspective. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2391–2404, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.134. URL https://aclanthology.org/2023.acl-long.134.
  32. Lucas D. Lingle. Transformer-vq: Linear-time transformers via vector quantization. CoRR, abs/2309.16354, 2023. doi: 10.48550/ARXIV.2309.16354. URL https://doi.org/10.48550/arXiv.2309.16354.
  33. Ring attention with blockwise transformers for near-infinite context. ArXiv, abs/2310.01889, 2023. URL https://api.semanticscholar.org/CorpusID:263608461.
  34. Fixing weight decay regularization in adam. 2018.
  35. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  10236–10242, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.697. URL https://aclanthology.org/2022.emnlp-main.697.
  36. Parallelizing linear recurrent neural nets over sequence length. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=HyUNwulC-.
  37. Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=5MkYIYCbva.
  38. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  39. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  40. RWKV: reinventing rnns for the transformer era. CoRR, abs/2305.13048, 2023. doi: 10.48550/ARXIV.2305.13048. URL https://doi.org/10.48550/arXiv.2305.13048.
  41. Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
  42. Accelerating non-power-of-2 size fourier transforms with GPU tensor cores. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17-21, 2021, pp.  507–516. IEEE, 2021. doi: 10.1109/IPDPS49936.2021.00059. URL https://doi.org/10.1109/IPDPS49936.2021.00059.
  43. Recurrent linear transformers. CoRR, abs/2310.15719, 2023. doi: 10.48550/ARXIV.2310.15719. URL https://doi.org/10.48550/arXiv.2310.15719.
  44. The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022.
  45. Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023a.
  46. Hierarchically gated recurrent neural network for sequence modeling. CoRR, abs/2311.04823, 2023b. doi: 10.48550/ARXIV.2311.04823. URL https://doi.org/10.48550/arXiv.2311.04823.
  47. Swish: a self-gated activation function. arXiv: Neural and Evolutionary Computing, 2017. URL https://api.semanticscholar.org/CorpusID:196158220.
  48. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
  49. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  50. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  9355–9366. PMLR, 2021. URL http://proceedings.mlr.press/v139/schlag21a.html.
  51. Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
  52. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  53. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=Ai8Hw3AXqks.
  54. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  55. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  56. Triton: an intermediate language and compiler for tiled neural network computations. In Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama (eds.), Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 2019, pp.  10–19. ACM, 2019. doi: 10.1145/3315508.3329973. URL https://doi.org/10.1145/3315508.3329973.
  57. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  58. Jos van der Westhuizen and Joan Lasenby. The unreasonable effectiveness of the forget gate. CoRR, abs/1804.04849, 2018. URL http://arxiv.org/abs/1804.04849.
  59. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  60. Pretraining without attention. CoRR, abs/2212.10544, 2022. doi: 10.48550/ARXIV.2212.10544. URL https://doi.org/10.48550/arXiv.2212.10544.
  61. Diffusion models without attention. 2023. URL https://api.semanticscholar.org/CorpusID:265506646.
  62. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  63. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Songlin Yang (42 papers)
  2. Bailin Wang (34 papers)
  3. Yikang Shen (62 papers)
  4. Rameswar Panda (79 papers)
  5. Yoon Kim (92 papers)
Citations (89)
Youtube Logo Streamline Icon: https://streamlinehq.com