LoLCATs: On Low-Rank Linearizing of Large Language Models (2410.10254v3)

Published 14 Oct 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Recent works show we can linearize LLMs -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However, linearizing LLMs often significantly degrades model quality, still requires training over billions of tokens, and remains limited to smaller 1.3B to 7B LLMs. We thus propose Low-rank Linear Conversion via Attention Transfer (LoLCATs), a simple two-step method that improves LLM linearizing quality with orders of magnitudes less memory and compute. We base these steps on two findings. First, we can replace an LLM's softmax attentions with closely-approximating linear attentions, simply by training the linear attentions to match their softmax counterparts with an output MSE loss ("attention transfer"). Then, this enables adjusting for approximation errors and recovering LLM quality simply with low-rank adaptation (LoRA). LoLCATs significantly improves linearizing quality, training efficiency, and scalability. We significantly reduce the linearizing quality gap and produce state-of-the-art subquadratic LLMs from Llama 3 8B and Mistral 7B v0.1, leading to 20+ points of improvement on 5-shot MMLU. Furthermore, LoLCATs does so with only 0.2% of past methods' model parameters and 0.4% of their training tokens. Finally, we apply LoLCATs to create the first linearized 70B and 405B LLMs (50x larger than prior work). When compared with prior approaches under the same compute budgets, LoLCATs significantly improves linearizing quality, closing the gap between linearized and original Llama 3.1 70B and 405B LLMs by 77.8% and 78.1% on 5-shot MMLU.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces LoLCATs, a novel approach that replaces softmax attention with efficient linear attention to reduce computational load while maintaining language quality.
It employs an attention transfer mechanism and LoRA fine-tuning to approximate softmax behavior with minimal parameter updates.
Experimental results show 77.8%-78.1% quality recovery on 5-shot MMLU while using only 0.2% of parameters and 0.4% of training tokens.

LoLCATs: On Low-Rank Linearizing of LLMs

The paper introduces LoLCATs, which stands for Low-rank Linear Conversion via Attention Transfer. This novel approach addresses the challenges associated with linearizing LLMs by optimizing the transformation of conventional Transformer architectures into more efficient subquadratic forms. The researchers focus on preserving LLM quality while significantly reducing memory and computational requirements during the process.

Key Contributions

Efficient Linear Attention: The paper emphasizes replacing softmax-attention mechanisms in LLMs with linear attention variants. The authors argue that this shift substantially optimizes computational efficiency, especially for large sequence lengths, while achieving results similar to softmax attention through refinement strategies such as low-rank adaptation.
Attention Transfer Mechanism: A pivotal part of LoLCATs is the attention transfer process. This involves training linear attentions to approximate softmax attentions by minimizing the mean squared error (MSE) between output distributions. Remarkably, this method bypasses the computationally intensive full-model training process, leading to faster and more cost-effective model adaptations.
Low-Rank Adaptation (LoRA): Post attention transfer, LoLCATs employs LoRA to fine-tune and adjust the models further, rectifying any disparity induced by the approximation. This low-rank fine-tuning only updates a small fraction of model parameters, facilitating scalability and efficiency.

Numerical Results

The paper reports that the LoLCATs approach significantly narrows the performance gap between linearized LLMs and their original forms. For instance, they demonstrate a 77.8% and 78.1% quality recovery on 5-shot MMLU compared to the original Llama 3.1 models—underscoring the efficacy of their method.

Furthermore, LoLCATs uses only 0.2% of previous methods' model parameters and 0.4% of training tokens, illustrating its computational and resource efficiency.

Implications and Future Directions

The implications of LoLCATs are two-fold. Practically, it makes it feasible to deploy LLMs in resource-constrained environments by reducing the computational footprint. Theoretically, it sets a precedent for upcoming research to focus on model adaptation strategies that prioritize efficiency without sacrificing capability.

Future research may delve into enhancing the robustness of linear attention approximations, possibly through novel feature map designs or hybrid approaches combining subquadratic architectures. Additionally, exploring LoLCATs' contributions to more comprehensive LLM capabilities, like few-shot learning and generalization across varied domains, will be crucial.

Overall, LoLCATs signifies a positive trajectory toward more sustainable AI development and deployment practices in the field of large-scale LLMing.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (8)

Tweets

https://twitter.com/mzhangio/status/1845870228456329271

https://twitter.com/fly51fly/status/1846202459456524556

https://twitter.com/rohanpaul_ai/status/1863925050254176494