Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry (2402.04347v1)

Published 6 Feb 2024 in cs.LG and cs.CL

Abstract: Linear attentions have shown potential for improving Transformer efficiency, reducing attention's quadratic complexity to linear in sequence length. This holds exciting promise for (1) training linear Transformers from scratch, (2) "finetuned-conversion" of task-specific Transformers into linear versions that recover task performance, and (3) "pretrained-conversion" of Transformers such as LLMs into linear versions finetunable on downstream tasks. However, linear attentions often underperform standard softmax attention in quality. To close this performance gap, we find prior linear attentions lack key properties of softmax attention tied to good performance: low-entropy (or "spiky") weights and dot-product monotonicity. We further observe surprisingly simple feature maps that retain these properties and match softmax performance, but are inefficient to compute in linear attention. We thus propose Hedgehog, a learnable linear attention that retains the spiky and monotonic properties of softmax attention while maintaining linear complexity. Hedgehog uses simple trainable MLPs to produce attention weights mimicking softmax attention. Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8.7 GLUE score points on finetuned bidirectional BERTs. Hedgehog also enables pretrained-conversion. Converting a pretrained GPT-2 into a linear attention variant achieves state-of-the-art 16.7 perplexity on WikiText-103 for 125M subquadratic decoder models. We finally turn a pretrained Llama-2 7B into a viable linear attention Llama. With low-rank adaptation, Hedgehog-Llama2 7B achieves 28.1 higher ROUGE-1 points over the base standard attention model, where prior linear attentions lead to 16.5 point drops.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
  2. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. Exploring alternatives to softmax function. ArXiv, abs/2011.11538, 2020. URL https://api.semanticscholar.org/CorpusID:227127574.
  5. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  7. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  8. Skyformer: Remodel self-attention with gaussian kernel and nystr\”om method. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=pZCYG7gjkKz.
  9. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  10. Hybrid random features. arXiv preprint arXiv:2110.04367, 2021.
  11. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285.
  12. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  13. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
  17. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu (eds.), Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp.  70–79, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URL https://aclanthology.org/D19-5409.
  18. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  19. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  21. Going beyond linear transformers with recurrent fast weight programmers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ot2ORiBqTa1.
  22. Finetuning pretrained transformers into RNNs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10630–10643, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.830. URL https://aclanthology.org/2021.emnlp-main.830.
  23. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  24. On the computational complexity of self-attention. In International Conference on Algorithmic Learning Theory, pp.  597–619. PMLR, 2023.
  25. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  10236–10242, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.697. URL https://aclanthology.org/2022.emnlp-main.697.
  26. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  27. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  28. Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
  29. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  30. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  7025–7041, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.473. URL https://aclanthology.org/2022.emnlp-main.473.
  31. cosformer: Rethinking softmax in attention. arXiv preprint arXiv:2202.08791, 2022b.
  32. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  33. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
  34. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp.  9355–9366. PMLR, 2021.
  35. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  37. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4344–4353, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1443. URL https://aclanthology.org/D19-1443.
  38. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  39. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
  40. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  14138–14148, 2021.
  41. Efficient attention via control variates. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=G-uNfHKrj46.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Michael Zhang (81 papers)
  2. Kush Bhatia (25 papers)
  3. Hermann Kumbong (5 papers)
  4. Christopher Ré (194 papers)
Citations (31)

Summary

Introduction

Linear attention mechanisms within Transformers propose the exciting potential to replace traditional softmax attention, which has a quadratic computational complexity with respect to the sequence length, with linear complexity alternatives. Despite these efficiency benefits, previously devised linear attentions often resulted in a substantially reduced model quality when compared to their softmax attention counterparts.

Bridging the Performance Gap

Identifying the crucial elements of softmax attention that linear variants lack, such as low-entropy weight distributions and dot-product monotonicity, the paper introduces an innovative approach. By utilizing trainable single-layer MLPs (multi-layer perceptrons) as feature maps, the proposed method—dubbed Hedgehog—achieves a high-performance linear attention that closely mirrors the qualities of softmax attention, specifically its capability to produce "spiky" and monotonic weights. Hedgehog's approach not only preserves linear computational complexity, but it also demonstrates excellent performance across several regimes, including training from scratch and finetuning.

Empirical Validation

Numerous experiments validate the effectiveness of Hedgehog, showcasing impressive performance that surpasses prior linear attention formulations. In training-from-scratch scenarios, Hedgehog demonstrates its prowess on standard benchmarks such as Long Range Arena (LRA) tasks and WikiText-103 LLMing, significantly closing the performance gap by 68.6% on the latter. In the finetuned-conversion and pretrained-conversion settings, Hedgehog consistently recovers over 99% of the original standard Transformer quality on tasks like Wikipedia text and the GLUE benchmark, convincingly outpacing prior linear attentions by substantial margins, with improvements up to 6 perplexity points and 8.7 GLUE score points, respectively.

Contributions and Scalability

The paper's method presents a compelling case for the practicality and scalability of linear attentions in Transformers, including state-of-the-art results for subquadratic models of a similar size after converting pretrained GPT-2 and significant improvements on the SAMSum summarization task using a scaled-up pretrained Llama-2 7B model. Notably, Hedgehog's attention preserves fidelity with increased sequence lengths and transfers effectively to new tasks, evidencing its adaptability and generalization capability. The findings suggest that by effectively mimicking softmax attention, it's possible to achieve near-equivalent performance with linear complexity, offering a blend of efficiency and expressivity previously unachieved by past linear attentions.