Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models (2402.19427v1)

Published 29 Feb 2024 in cs.LG and cs.CL

Abstract: Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

References (65)

Citations (80)

View on Semantic Scholar

Summary

The paper presents a hybrid architecture, Griffin, that integrates RG-LRU with local attention to efficiently handle long sequences.
It demonstrates superior performance over traditional RNNs and rivals Transformer models like Llama-2 despite using significantly fewer training tokens.
The models showcase robust scalability by efficiently extrapolating beyond observed sequence lengths, paving the way for resource-efficient language modeling.

Efficient Scaling of LLMs with Hawk and Griffin: Bridging RNNs and Local Attention

Introduction

The landscape of NLP has notably shifted towards Transformer models due to their remarkable ability to utilize modern hardware efficiently and achieve superior performance across a wide array of tasks. Despite their advantages, the scalability of Transformers, especially concerning sequence lengths, remains constrained by the quadratic complexity associated with global attention mechanisms. This paper introduces two novel architectures: Hawk, centered on a gated linear recurrent unit named RG-LRU, and Griffin, a hybrid model integrating RG-LRU with local attention. These models not only embody the efficiency of RNNs for handling long sequences but also maintain competitive performance levels comparable to large Transformers, even when trained on significantly fewer tokens.

Model Architecture

The core of the presented work lies in the innovative use of RG-LRU, a gated linear recurrent layer designed to efficiently process sequences. This design choice facilitates a model that can scale efficiently, akin to Transformer models, but with a more effective management of long sequences. The paper delineates the architecture of both Hawk and Griffin, with Griffin uniquely combining the strengths of local attention mechanisms and RG-LRU to efficiently manage sequence-related tasks.

Hawk relies entirely on the RG-LRU layer for temporal mixing, showcasing an ability to efficiently scale and adapt to increasingly long sequences.
Griffin emerges as a hybrid, incorporating the RG-LRU layer alongside segments of local attention to better handle recent information in sequence processing tasks. This design enables Griffin to leverage the spatial efficiency of RNNs while harnessing the modeling capabilities of local attention for tasks requiring acute awareness of immediate context.

Evaluation and Performance

The evaluation of Hawk and Griffin unfolds across multiple dimensions, from held-out loss and hardware efficiency to throughput and latency during inference. Notably, Hawk outperforms existing recurrent models like Mamba on downstream tasks, even with significantly less training data. Griffin, despite its reduced training data footprint, matches or slightly surpasses the performance of the widely recognized Llama-2 Transformer model.

One of the standout findings is the models' ability to efficiently extrapolate beyond the sequence lengths observed during training, underscoring their potential for handling tasks characterized by long dependencies. This capability is particularly pronounced in Griffin, which balances the memory efficiency of RNNs with the contextual richness provided by local attention.

Implications and Future Directions

The implications of this work are twofold. Practically, Hawk and Griffin offer a pathway to more resource-efficient training and inference in LLMs, especially pertinent for sequences of extended lengths. Theoretically, these architectures contribute to the ongoing discourse on the optimal balance between global and local processing mechanisms in sequence modeling.

Looking ahead, the scalability and efficiency demonstrated by Hawk and Griffin prompt a reconsideration of the prevailing reliance on global attention mechanisms, especially for tasks where sequence length poses a distinct challenge. Further exploration of hybrid models, as exemplified by Griffin, may yield even more efficient architectures capable of navigating the trade-offs between computational resources, sequence length, and performance.

Conclusion

In summary, this paper presents a critical advancement in the understanding and application of recurrent neural networks for efficient LLMing. Hawk and Griffin not only challenge the current Transformer-dominated paradigm by offering comparable performance but also illuminate a path forward for the development of models that can more adeptly manage long sequences. As the field of NLP continues to evolve, the exploration of such efficient, scalable architectures will undoubtedly play a pivotal role in shaping future research directions and applications.

Related Papers

Tweets

https://twitter.com/OfficialLoganK/status/1778079024113127445

https://twitter.com/_akhaliq/status/1763417689878413552

https://twitter.com/rohanpaul_ai/status/1777747790564589844

https://twitter.com/jeethu/status/1777703476195196982

https://twitter.com/iScienceLuvr/status/1763406378159190212

https://twitter.com/sohamde_/status/1768346638286762045