Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models (2404.07839v2)

Published 11 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce RecurrentGemma, a family of open LLMs which uses Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (10)
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (62)
  1. Aleksandar Botev (17 papers)
  2. Soham De (38 papers)
  3. Anushan Fernando (3 papers)
  4. George-Cristian Muraru (2 papers)
  5. Ruba Haroun (3 papers)
  6. Leonard Berrada (14 papers)
  7. Razvan Pascanu (138 papers)
  8. Pier Giuseppe Sessa (26 papers)
  9. Robert Dadashi (25 papers)
  10. Léonard Hussenot (25 papers)
  11. Johan Ferret (24 papers)
  12. Sertan Girgin (24 papers)
  13. Olivier Bachem (52 papers)
  14. Alek Andreev (7 papers)
  15. Kathleen Kenealy (11 papers)
  16. Thomas Mesnard (18 papers)
  17. Cassidy Hardin (5 papers)
  18. Surya Bhupatiraju (11 papers)
  19. Shreya Pathak (12 papers)
  20. Laurent Sifre (21 papers)
Citations (21)

Summary

Exploring RecurrentGemma-2B: A High-Performance, Efficient Inference Model on Long Sequences

Introduction to RecurrentGemma-2B

RecurrentGemma-2B is an open model leveraging the Griffin architecture with a focus on improving efficiency and performance in processing long sequences. Unlike traditional transformers that struggle with the demands of large KV caches, RecurrentGemma-2B introduces an innovative approach by compressing input sequences into a fixed-size state, thereby achieving faster inference speeds and reduced memory usage without compromising on performance metrics. This model stands competitive with the Gemma-2B model, underlining its significance in the domain of LLMs.

Model Architecture

The architectural foundation of RecurrentGemma-2B involves a critical revision of the Griffin architecture, characterized by a singular modification: scaling input embeddings by the square root of the model's width. This model distinguishes itself by not applying weight decay to the parameters of recurrent layers during training, showcasing a deliberate decision aimed at enhancing model stability and performance.

Training Details and Instruction Tuning

The paper delineates comprehensive training protocols, revealing that RecurrentGemma-2B was pre-trained on a 2T token dataset, a selection curated to minimize the risk of propagating unwanted outputs. In an illuminating comparison, the model achieves parity with the performance metrics of Gemma-2B, despite the latter being trained on 50% more tokens. Furthermore, a novel Reinforcement Learning from Human Feedback (RLHF) method was employed to fine-tune the model, ensuring adeptness at instruction-following and dialogue-oriented tasks.

Evaluation Across Benchmarks

The evaluative process compared RecurrentGemma-2B against Gemma-2B across various benchmarks, including academic and safety-oriented metrics. While demonstrating competitive expertise in a multitude of tasks, the model notably outperforms in inference speed, specifically in handling longer sequences - a critical advantage highlighted by its design.

Inference Speed and Practical Implications

A significant portion of the discussion focuses on inference speed benchmarks, where RecurrentGemma-2B surpasses its peers, particularly in generating long sequences. This efficiency is attributed to its compact state size, allowing the model to operate at larger batch sizes and achieve higher throughput. These qualities suggest that RecurrentGemma-2B is well-positioned for implementation in resource-constrained environments, potentially unlocking new applications for small, highly performant LLMs.

Responsible Deployment

Adopting a responsible lens, the authors elaborate on safety protocols and ethical considerations adhered to during the development of RecurrentGemma-2B. The model underwent rigorous safety and ethics evaluations, though the paper advises users to perform use-case specific safety analyses prior to deployment.

Conclusion and Future Directions

Conclusively, RecurrentGemma-2B emerges as a robust model that harmonizes the dual aims of performance and efficiency, particularly for long-sequence data processing. The model's architectural innovations and training methodologies represent a meaningful advancement in LLMs, offering insights into future developments aimed at reducing computational demands while maintaining or enhancing model capabilities.

As AI research continues to evolve, models like RecurrentGemma-2B underscore the importance of optimizing not just for accuracy but efficiency and ethical responsibility, setting a precedent for future innovations in the field.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com