RecurrentGemma: Moving Past Transformers for Efficient Open Language Models (2404.07839v2)

Published 11 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce RecurrentGemma, a family of open LLMs which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.

References (10)

Citations (21)

View on Semantic Scholar

Summary

The paper presents a recurrent architecture that compresses input sequences into fixed-size states, enabling faster inference and lower memory usage versus traditional transformers.
It introduces a key modification to the Griffin architecture by scaling input embeddings and omitting weight decay on recurrent layers to enhance stability and performance.
Evaluations show that despite using fewer training tokens than Gemma-2B, RecurrentGemma-2B achieves competitive performance and excels in long-sequence processing.

Exploring RecurrentGemma-2B: A High-Performance, Efficient Inference Model on Long Sequences

Introduction to RecurrentGemma-2B

RecurrentGemma-2B is an open model leveraging the Griffin architecture with a focus on improving efficiency and performance in processing long sequences. Unlike traditional transformers that struggle with the demands of large KV caches, RecurrentGemma-2B introduces an innovative approach by compressing input sequences into a fixed-size state, thereby achieving faster inference speeds and reduced memory usage without compromising on performance metrics. This model stands competitive with the Gemma-2B model, underlining its significance in the domain of LLMs.

Model Architecture

The architectural foundation of RecurrentGemma-2B involves a critical revision of the Griffin architecture, characterized by a singular modification: scaling input embeddings by the square root of the model's width. This model distinguishes itself by not applying weight decay to the parameters of recurrent layers during training, showcasing a deliberate decision aimed at enhancing model stability and performance.

Training Details and Instruction Tuning

The paper delineates comprehensive training protocols, revealing that RecurrentGemma-2B was pre-trained on a 2T token dataset, a selection curated to minimize the risk of propagating unwanted outputs. In an illuminating comparison, the model achieves parity with the performance metrics of Gemma-2B, despite the latter being trained on 50% more tokens. Furthermore, a novel Reinforcement Learning from Human Feedback (RLHF) method was employed to fine-tune the model, ensuring adeptness at instruction-following and dialogue-oriented tasks.

Evaluation Across Benchmarks

The evaluative process compared RecurrentGemma-2B against Gemma-2B across various benchmarks, including academic and safety-oriented metrics. While demonstrating competitive expertise in a multitude of tasks, the model notably outperforms in inference speed, specifically in handling longer sequences - a critical advantage highlighted by its design.

Inference Speed and Practical Implications

A significant portion of the discussion focuses on inference speed benchmarks, where RecurrentGemma-2B surpasses its peers, particularly in generating long sequences. This efficiency is attributed to its compact state size, allowing the model to operate at larger batch sizes and achieve higher throughput. These qualities suggest that RecurrentGemma-2B is well-positioned for implementation in resource-constrained environments, potentially unlocking new applications for small, highly performant LLMs.

Responsible Deployment

Adopting a responsible lens, the authors elaborate on safety protocols and ethical considerations adhered to during the development of RecurrentGemma-2B. The model underwent rigorous safety and ethics evaluations, though the paper advises users to perform use-case specific safety analyses prior to deployment.

Conclusion and Future Directions

Conclusively, RecurrentGemma-2B emerges as a robust model that harmonizes the dual aims of performance and efficiency, particularly for long-sequence data processing. The model's architectural innovations and training methodologies represent a meaningful advancement in LLMs, offering insights into future developments aimed at reducing computational demands while maintaining or enhancing model capabilities.

As AI research continues to evolve, models like RecurrentGemma-2B underscore the importance of optimizing not just for accuracy but efficiency and ethical responsibility, setting a precedent for future innovations in the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sohamde_/status/1829110759697117599

https://twitter.com/reach_vb/status/1800568911177425198

https://twitter.com/fly51fly/status/1778787044468326905

https://twitter.com/rohanpaul_ai/status/1843800314488443135

https://twitter.com/KarSergios/status/1780191657767677990

https://twitter.com/austinvhuang/status/1871859896158933189