Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gemma: Open Models Based on Gemini Research and Technology (2403.08295v4)

Published 13 Mar 2024 in cs.CL and cs.AI
Gemma: Open Models Based on Gemini Research and Technology

Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.

Gemma: Expanding the Horizon of Open Models with Gemini's Technological Insights

Overview of Gemma Models

Google DeepMind's Gemma models represent a significant stride in the development of open models, derived from the foundational work on Gemini models. By implementing state-of-the-art architectural design, training regimes, and safety protocols, Gemma offers a set of lightweight, high-performance models tailored for a variety of applications. Addressing both computational efficiency and the necessity for responsible AI development, these models set new benchmarks in language understanding, reasoning, and model safety across diverse text-based tasks.

Model Architectures and Parameters

The architecture of Gemma models leverages the transformer decoder framework, incorporating numerous enhancements that bolster its performance and efficiency. Notable among these are the implementation of multi-query attention and rotary positional embeddings, alongside GeGLU activation functions and RMSNorm normalization. These improvements manifest in two distinct model sizes: a 2-billion parameter model optimized for CPU and on-device applications and a more robust 7-billion parameter version designed for GPU and TPU deployment. Both variations inherit the comprehensive vocabulary of 256k tokens, reflecting a design ethos that emphasizes versatility and scalability.

Training Procedures and Data Utilization

Gemma's training infrastructure employs Google's cutting-edge TPU technology, facilitating highly efficient model training across a wide array of computational settings. The models are pretrained on vast corpora encompassing web documents, mathematical content, and coding data, fine-tuned via supervised approaches and reinforced through human feedback. This compound training strategy is fine-tuned to align model outputs closely with human preferences, enhancing both the utility and safety of the models.

Evaluations and Benchmarks

Methodical evaluations underscore Gemma's superior performance over similar and even larger open models across a spectrum of benchmark tasks. Across domains such as question answering, commonsense reasoning, and coding, Gemma models consistently outperform counterparts, setting new performance standards. Notably, in mathematics and coding benchmarks, Gemma demonstrates exceptional prowess, reflecting its broader capabilities in generalist tasks and specialized domains alike.

Safety Measures and Responsible Deployment

DeepMind has undertaken rigorous safety assessments and incorporated multiple layers of mitigation strategies to address potential risks associated with Gemma's deployment. These include extensive filtering of training data, adherence to structured development protocols, and commitments to ongoing evaluation and refinement to safeguard against both unintentional and malicious misuses of the technology.

Future Directions and Conclusion

Gemma represents a pivotal advancement in the landscape of open AI models, driven by methodological innovation and a commitment to ethical AI principles. By providing access to both pretrained and fine-tuned models, the Gemma project invites exploration and development within the research community, promising to catalyze further breakthroughs in AI capabilities. While acknowledging the inherent limitations and areas for further research, the deployment of Gemma models is a calculated step toward democratizing AI research and enabling a new generation of AI applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. The falcon series of open language models, 2023.
  2. Concrete problems in AI safety. arXiv preprint, 2016.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
  5. Constitutional ai: Harmlessness from ai feedback, 2022.
  6. Pathways: Asynchronous distributed dataflow for ml, 2022.
  7. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019. URL http://arxiv.org/abs/1911.11641.
  8. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39, 1952.
  9. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  10. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  11. Palm: Scaling language modeling with pathways, 2022.
  12. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
  13. Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044, 2019. URL http://arxiv.org/abs/1905.10044.
  14. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  15. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  16. Large scale distributed deep networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf.
  17. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
  18. Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
  19. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
  20. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  21. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
  22. Mistral 7b, 2023.
  23. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. CoRR, abs/1705.03551, 2017. URL http://arxiv.org/abs/1705.03551.
  24. How our principles helped define alphafold’s release, 2022.
  25. T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  26. Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662, 2023.
  27. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  28. Deep learning. nature, 521(7553):436–444, 2015.
  29. Efficient estimation of word representations in vector space. In Y. Bengio and Y. LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. URL http://arxiv.org/abs/1301.3781.
  30. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022.
  32. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions, 2023.
  33. The LAMBADA dataset: Word prediction requiring a broad discourse context. CoRR, abs/1606.06031, 2016. URL http://arxiv.org/abs/1606.06031.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  35. Scaling up models and data with t5x and seqio, 2022.
  36. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
  37. WINOGRANDE: an adversarial winograd schema challenge at scale. CoRR, abs/1907.10641, 2019. URL http://arxiv.org/abs/1907.10641.
  38. Socialiqa: Commonsense reasoning about social interactions. CoRR, abs/1904.09728, 2019. URL http://arxiv.org/abs/1904.09728.
  39. N. Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150.
  40. N. Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
  41. Defining and characterizing reward gaming. In NeurIPS, 2022.
  42. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  43. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/1409.3215.
  44. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  45. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
  46. Llama: Open and efficient foundation language models, 2023a.
  47. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  48. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  49. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022. URL https://arxiv.org/abs/2201.11903.
  50. Ethical and social risks of harm from language models. CoRR, abs/2112.04359, 2021. URL https://arxiv.org/abs/2112.04359.
  51. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8, 1992.
  52. XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla.
  53. GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021. URL https://arxiv.org/abs/2105.04663.
  54. B. Zhang and R. Sennrich. Root mean square layer normalization. CoRR, abs/1910.07467, 2019. URL http://arxiv.org/abs/1910.07467.
  55. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  56. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
  57. Representation engineering: A top-down approach to ai transparency, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (108)
  1. Gemma Team (3 papers)
  2. Thomas Mesnard (18 papers)
  3. Cassidy Hardin (5 papers)
  4. Robert Dadashi (25 papers)
  5. Surya Bhupatiraju (11 papers)
  6. Shreya Pathak (12 papers)
  7. Laurent Sifre (21 papers)
  8. Morgane Rivière (26 papers)
  9. Mihir Sanjay Kale (6 papers)
  10. Juliette Love (5 papers)
  11. Pouya Tafti (5 papers)
  12. Léonard Hussenot (25 papers)
  13. Aakanksha Chowdhery (19 papers)
  14. Adam Roberts (46 papers)
  15. Aditya Barua (9 papers)
  16. Alex Botev (1 paper)
  17. Alex Castro-Ros (4 papers)
  18. Ambrose Slone (7 papers)
  19. Amélie Héliou (10 papers)
  20. Andrea Tacchetti (26 papers)
Citations (298)
Youtube Logo Streamline Icon: https://streamlinehq.com