Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models (2406.07368v2)

Published 11 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Autoregressive LLMs have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2$\times$ speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haoran You (33 papers)
  2. Yichao Fu (18 papers)
  3. Zheng Wang (400 papers)
  4. Amir Yazdanbakhsh (38 papers)
  5. Yingyan Celine Lin (19 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com