Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models (2402.19427v1)

Published 29 Feb 2024 in cs.LG and cs.CL

Abstract: Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  3. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  5. Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576, 2016.
  6. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020.
  8. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  9. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022a.
  11. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022b.
  12. Language modeling with gated convolutional networks. In International Conference on Machine Learning, pages 933–941. PMLR, 2017.
  13. J. L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
  14. Gemini Team Google. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  15. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633, 2022.
  16. A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  17. Hippo: Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, volume 33, pages 1474–1487, 2020.
  18. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021a.
  19. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Advances in Neural Information Processing Systems, volume 34, pages 572–585, 2021b.
  20. On the parameterization and initialization of diagonal state space models. arXiv preprint arXiv:2206.11893, 2022.
  21. D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  22. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  23. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  24. Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032, 2024.
  25. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  26. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–14, 2023.
  27. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2021.
  28. R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82, 1960.
  29. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  30. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
  31. T. Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. arXiv preprint arXiv:2311.01927, 2023.
  32. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
  33. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 2002.
  34. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  35. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pages 522–531. IEEE, 2018.
  37. E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. arXiv preprint arXiv:1709.04057, 2017.
  38. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  39. Recurrent neural network based language model. In INTERSPEECH 11th Annual Conference of the International Speech Communication Association, pages 1045–1048, 2010.
  40. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  41. The design process for Google’s training chips: TPUv2 and TPUv3. IEEE Micro, 41(2):56–63, 2021.
  42. On the universality of linear recurrences followed by nonlinear projections. arXiv preprint arXiv:2307.11888, 2023a.
  43. Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023b.
  44. Rwkv: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  45. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  46. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021.
  47. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  48. N. Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  49. N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  50. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  51. Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991. ISSN 0893-9659.
  52. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  53. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  54. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  55. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
  56. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  57. LLama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  58. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  59. Mambabyte: Token-free selective state space model. arXiv preprint arXiv:2401.13660, 2024.
  60. P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
  61. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  62. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
  63. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
  64. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  65. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
Citations (80)

Summary

  • The paper presents a hybrid architecture, Griffin, that integrates RG-LRU with local attention to efficiently handle long sequences.
  • It demonstrates superior performance over traditional RNNs and rivals Transformer models like Llama-2 despite using significantly fewer training tokens.
  • The models showcase robust scalability by efficiently extrapolating beyond observed sequence lengths, paving the way for resource-efficient language modeling.

Efficient Scaling of LLMs with Hawk and Griffin: Bridging RNNs and Local Attention

Introduction

The landscape of NLP has notably shifted towards Transformer models due to their remarkable ability to utilize modern hardware efficiently and achieve superior performance across a wide array of tasks. Despite their advantages, the scalability of Transformers, especially concerning sequence lengths, remains constrained by the quadratic complexity associated with global attention mechanisms. This paper introduces two novel architectures: Hawk, centered on a gated linear recurrent unit named RG-LRU, and Griffin, a hybrid model integrating RG-LRU with local attention. These models not only embody the efficiency of RNNs for handling long sequences but also maintain competitive performance levels comparable to large Transformers, even when trained on significantly fewer tokens.

Model Architecture

The core of the presented work lies in the innovative use of RG-LRU, a gated linear recurrent layer designed to efficiently process sequences. This design choice facilitates a model that can scale efficiently, akin to Transformer models, but with a more effective management of long sequences. The paper delineates the architecture of both Hawk and Griffin, with Griffin uniquely combining the strengths of local attention mechanisms and RG-LRU to efficiently manage sequence-related tasks.

  • Hawk relies entirely on the RG-LRU layer for temporal mixing, showcasing an ability to efficiently scale and adapt to increasingly long sequences.
  • Griffin emerges as a hybrid, incorporating the RG-LRU layer alongside segments of local attention to better handle recent information in sequence processing tasks. This design enables Griffin to leverage the spatial efficiency of RNNs while harnessing the modeling capabilities of local attention for tasks requiring acute awareness of immediate context.

Evaluation and Performance

The evaluation of Hawk and Griffin unfolds across multiple dimensions, from held-out loss and hardware efficiency to throughput and latency during inference. Notably, Hawk outperforms existing recurrent models like Mamba on downstream tasks, even with significantly less training data. Griffin, despite its reduced training data footprint, matches or slightly surpasses the performance of the widely recognized Llama-2 Transformer model.

One of the standout findings is the models' ability to efficiently extrapolate beyond the sequence lengths observed during training, underscoring their potential for handling tasks characterized by long dependencies. This capability is particularly pronounced in Griffin, which balances the memory efficiency of RNNs with the contextual richness provided by local attention.

Implications and Future Directions

The implications of this work are twofold. Practically, Hawk and Griffin offer a pathway to more resource-efficient training and inference in LLMs, especially pertinent for sequences of extended lengths. Theoretically, these architectures contribute to the ongoing discourse on the optimal balance between global and local processing mechanisms in sequence modeling.

Looking ahead, the scalability and efficiency demonstrated by Hawk and Griffin prompt a reconsideration of the prevailing reliance on global attention mechanisms, especially for tasks where sequence length poses a distinct challenge. Further exploration of hybrid models, as exemplified by Griffin, may yield even more efficient architectures capable of navigating the trade-offs between computational resources, sequence length, and performance.

Conclusion

In summary, this paper presents a critical advancement in the understanding and application of recurrent neural networks for efficient LLMing. Hawk and Griffin not only challenge the current Transformer-dominated paradigm by offering comparable performance but also illuminate a path forward for the development of models that can more adeptly manage long sequences. As the field of NLP continues to evolve, the exploration of such efficient, scalable architectures will undoubtedly play a pivotal role in shaping future research directions and applications.

Youtube Logo Streamline Icon: https://streamlinehq.com