Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linearizing Large Language Models (2405.06640v1)

Published 10 May 2024 in cs.CL
Linearizing Large Language Models

Abstract: Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training LLMs requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models. Our code and models can be found at https://github.com/TRI-ML/linear_open_lm.

Understanding SUPRA: A New Approach to Linearize Pre-trained Transformers into RNNs

Overview of the Proposed Method

The freshly introduced method called Scalable UPtraining for Recurrent Attention (SUPRA) seeks a cost-effective way to transform pre-trained transformers into Recurrent Neural Networks (RNNs). This approach could potentially leverage the strengths of both architectures — the powerful pre-training of transformers and the cost-efficient inference capabilities of RNNs.

The Challenge with Linear Transformers

Conventional transformers excel due to their parallelizable nature, which provides high efficiency in training over long sequences. However, they suffer from high inference costs relative to RNNs, which maintain a fixed-size hidden state and are, as a result, generally more memory-efficient.

Despite the introduction of linear transformers as a subtype that aims to replicate the trainable advantages of the transformer while catering to memory efficiency, they typically fall short of conventional transformers on intensive Natural Language Processing benchmarks.

Enter SUPRA: A Hybrid Training Approach

SUPRA introduces a middle ground by uptraining - a process of continuing training with a modified architecture. This method begins with a well-established pre-trained transformer and subtly adjusts it to imitate an RNN during inference time.

The Process

  1. Linearization Technique: Replace the softmax normalization commonly found in transformers with GroupNorm, a type of normalization that can help balance the output across different nodes in a network.
  2. Positional Encoding Adjustment: Integrate a rotary positional encoding scheme to cater to disturbances often encountered with absolute positional encoding in RNNs.

SUPRA cleverly addresses an issue inherent to other linear approaches by utilizing only a small fraction of the original training tokens for uptraining. This ensures cost-effectiveness while maintaining competitive model performance.

Testing the Performance

The uptrained models were rigorously evaluated across several benchmarks:

  • Standard Language Benchmarks: SUPRA models displayed competitive performance against leading pretrained recurrent models using notably less data and compute resources.
  • Long-Context Tasks: Despite their promise, the linearized models under SUPRA showed limitations in tasks requiring extended context, underscoring a gap that still exists with conventional transformers.

The Implications and Future Prospects

The birth of SUPRA shifts the landscape for how large-scale models could be transformed to achieve efficiency without an enormous compute overhead. Practically, this could make RNNs viable again for applications where inference cost and resource efficiency are critical.

On the Theoretical Side

SUPRA shows that there's a rich vein to explore in the hybrid modeling approach, potentially setting the stage for future research focusing on optimizing these hybrid architectures.

Looking Forward

While SUPRA demonstrates a promising approach, the models' struggle with long-context tasks suggests a need for further tweaking. Innovations such as more complex gating mechanisms and alternative normalization techniques could potentially bridge the observed performance gap.

Conclusion

SUPRA presents an intriguing prospect in the quest for efficient AI modeling, offering a new toolkit for those looking to harness the strengths of transformers and RNNs alike. With continued development, SUPRA or its derivatives might soon become a staple in reducing computational costs while sustaining high performance across a range of AI tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. In-context language learning: Arhitectures and algorithms. arXiv preprint arXiv:2401.12973, 2024.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
  6. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  7. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  8. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4599–4610, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.365. URL https://aclanthology.org/2021.naacl-main.365.
  9. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024.
  10. emozilla. Dynamically scaled rope further increases strength of retaining walls, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. Reddit post.
  11. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  12. Recurrentgemma. arXiv preprint arXiv:2404.07839, 2024.
  13. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  14. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  15. OpenLM: a minimal but performative language modeling (lm) repository, 2023. URL https://github.com/mlfoundations/open_lm/. GitHub repository.
  16. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  18. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  19. Finetuning pretrained transformers into rnns. arXiv preprint arXiv:2103.13076, 2021.
  20. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  21. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl˙a˙00023. URL https://aclanthology.org/Q18-1023.
  22. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  23. Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453, 2021.
  24. Jean Mercat. Higher order linear transformer. arXiv preprint arXiv:2010.14816, 2020.
  25. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 2024.
  26. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  27. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023a.
  28. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023b.
  29. Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
  30. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  31. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  32. The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022a.
  33. cosformer: Rethinking softmax in attention. arXiv preprint arXiv:2202.08791, 2022b.
  34. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. arXiv preprint arXiv:2401.04658, 2024.
  35. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  12007–12021, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.823.
  36. Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533, 2022b.
  37. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  38. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  39. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, pp.  10–19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367196. doi: 10.1145/3315508.3329973. URL https://doi.org/10.1145/3315508.3329973.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  41. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  42. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp.  3–19, 2018.
  43. Effective long-context scaling of foundation models, 2023.
  44. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
  45. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. arXiv preprint arXiv:2402.04347, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jean Mercat (15 papers)
  2. Igor Vasiljevic (20 papers)
  3. Sedrick Keh (8 papers)
  4. Kushal Arora (13 papers)
  5. Achal Dave (31 papers)
  6. Adrien Gaidon (84 papers)
  7. Thomas Kollar (27 papers)
Citations (9)

HackerNews

  1. Linearizing Large Language Models (2 points, 0 comments)