Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models (2402.19427v1)
Abstract: Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576, 2016.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022a.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022b.
- Language modeling with gated convolutional networks. In International Conference on Machine Learning, pages 933–941. PMLR, 2017.
- J. L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
- Gemini Team Google. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633, 2022.
- A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Hippo: Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, volume 33, pages 1474–1487, 2020.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021a.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Advances in Neural Information Processing Systems, volume 34, pages 572–585, 2021b.
- On the parameterization and initialization of diagonal state space models. arXiv preprint arXiv:2206.11893, 2022.
- D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–14, 2023.
- Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2021.
- R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82, 1960.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
- T. Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. arXiv preprint arXiv:2311.01927, 2023.
- The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 2002.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pages 522–531. IEEE, 2018.
- E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. arXiv preprint arXiv:1709.04057, 2017.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
- Recurrent neural network based language model. In INTERSPEECH 11th Annual Conference of the International Speech Communication Association, pages 1045–1048, 2010.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
- The design process for Google’s training chips: TPUv2 and TPUv3. IEEE Micro, 41(2):56–63, 2021.
- On the universality of linear recurrences followed by nonlinear projections. arXiv preprint arXiv:2307.11888, 2023a.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023b.
- Rwkv: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446, 2021.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- N. Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991. ISSN 0893-9659.
- Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
- LLama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Mambabyte: Token-free selective state space model. arXiv preprint arXiv:2401.13660, 2024.
- P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
- An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
- B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.