Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gemma 2: Improving Open Language Models at a Practical Size (2408.00118v3)

Published 31 Jul 2024 in cs.CL and cs.AI
Gemma 2: Improving Open Language Models at a Practical Size

Abstract: In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

Gemma 2: Enhancing Small Open LLMs with Advanced Techniques

The paper "Gemma 2: Improving Open LLMs at a Practical Size" by the Gemma Team at Google DeepMind details the advancements made in the development of the Gemma 2 family of LLMs. These models range in scale from 2 billion to 27 billion parameters, aspiring to deliver high performance while maintaining a practical size. The primary focus of this work is the application of several advanced techniques to improve model architecture and training, substantially augmenting the performance of smaller models without proportionally increasing their size.

Model Architecture and Training Innovations

The Gemma 2 models build on the transformer architecture and incorporate enhancements such as interleaving local-global attentions and group-query attention (GQA). These modifications are crucial for balancing computational efficiency and model performance.

Key Innovations:

  1. Interleaving Local-Global Attentions: This technique alternates between local sliding window attention and global attention, facilitating both detailed local interactions and broader context awareness. This is achieved with a sliding window size of 4096 tokens and a global attention span of 8192 tokens.
  2. Grouped-Query Attention (GQA): GQA improves inference speed and maintains downstream performance by reducing the number of active parameters during attention computation.
  3. Knowledge Distillation: Instead of conventional next-token prediction, the smaller 2B and 9B models are trained using knowledge distillation. Here, a larger teacher model’s probability distribution is used to provide richer training signals, simulating an extended training regime.
  4. Training Data Efficiency: The 2B, 9B, and 27B models are trained on 2 trillion, 8 trillion, and 13 trillion tokens, respectively. This dataset includes various sources like web documents, code, and science articles, ensuring a broad and robust training corpus.

Architecture-Specific Enhancements

The paper discusses additional architectural decisions that contribute to the superior performance of Gemma 2 models:

  • Logit Soft-Capping: Applied to limit the outputs of each attention layer, preventing extreme values which could destabilize training.
  • RMSNorm for Normalization: RMSNorm is used for stabilizing the training process by normalizing the inputs and outputs of each transformer sub-layer.
  • Deeper Networks: For models like the 9B and 27B, deeper architectures were found to marginally outperform wider networks, justifying the switch to increased depth within parameter constraints.

Performance Evaluation

The Gemma 2 models undergo rigorous evaluation across a plethora of benchmarks:

  • Automated Benchmarks: Results on benchmarks like MMLU, GSM8K, ARC-c, HellaSwag, and others demonstrate that Gemma 2 models significantly outperform previous iterations and are competitive with larger models. The 27B model, for instance, achieves competitive performance metrics against the larger LLaMA-3 70B model while being considerably smaller.
  • Human Evaluations: The instruction-tuned Gemma 2 models also show marked improvements in human preference evaluations and safety assessments. These models exhibit low violation rates across several safety metrics and maintain robust performance under adversarial conditions.

Ablations and Insights

The paper provides insightful ablations examining the impact of various architectural and training choices:

  • Knowledge Distillation vs. Training from Scratch: Distilled models significantly outperform those trained from scratch, even when trained on an equivalent amount of data.
  • Scaling Effects: Distillation continues to benefit larger models, indicating the scalable advantages of the technique.
  • Attention Mechanisms: Switching to GQA from traditional multi-head attention provides inference speed benefits with minimal performance trade-offs.

Implications and Future Directions

The findings from Gemma 2 have broad implications for the development of efficient LLMs:

  • Practical Scaling: Enhancements like interleaved attention mechanisms and knowledge distillation allow smaller models to achieve performance levels previously reserved for much larger models, democratizing access to advanced language understanding capabilities.
  • Efficiency in Training and Inference: The adoption of techniques like GQA and logit soft-capping ensure that models remain computationally feasible during both training and deployment, making them accessible for a wider range of applications and environments.

The successful implementation of these techniques in Gemma 2 opens up avenues for further research into efficient and scalable model training. Future endeavors could involve the exploration of more advanced training signals, adaptive learning mechanisms, and further optimizations in attention mechanisms to push the envelope of performance versus practicality in LLM development.

In conclusion, Gemma 2 represents a significant step forward in the quest for high-performance yet practical LLMs. The advancements presented not only enhance the current state of small-scale models but also set the stage for future innovations in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024.
  3. AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  4. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  5. The falcon series of open language models, 2023.
  6. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  7. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
  8. Pathways: Asynchronous distributed dataflow for ml, 2022.
  9. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016. URL http://arxiv.org/abs/1611.09940.
  10. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020a.
  11. Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020b. URL https://arxiv.org/abs/2004.05150.
  12. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  13. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  14. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  15. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
  16. Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044, 2019. URL http://arxiv.org/abs/1905.10044.
  17. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  18. Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
  19. Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
  20. Gemma Team. Gemma: Open models based on gemini research and technology, 2024.
  21. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024.
  22. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  24. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  25. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
  26. Mistral 7b, 2023.
  27. Llm comparator: Visual analytics for side-by-side evaluation of large language models, 2024. URL https://arxiv.org/abs/2402.10524.
  28. Evaluating language-model agents on realistic autonomous tasks, 2024. URL https://arxiv.org/abs/2312.11671.
  29. T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  30. Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662, 2023.
  31. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  32. Malla: Demystifying real-world large language model integrated malicious services, 2024. URL https://arxiv.org/abs/2401.03315.
  33. Effective approaches to attention-based neural machine translation. CoRR, abs/1508.04025, 2015. URL http://arxiv.org/abs/1508.04025.
  34. Personal Communication, 2024.
  35. Towards agile text classifiers for everyone, 2023. URL https://arxiv.org/abs/2302.06541.
  36. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  37. Evaluating frontier models for dangerous capabilities, 2024. URL https://arxiv.org/abs/2403.13793.
  38. Language models are unsupervised multitask learners, 2019.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  40. Warp: On the benefits of weight averaged rewarded policies, 2024.
  41. {{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
  42. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
  43. WINOGRANDE: an adversarial winograd schema challenge at scale. CoRR, abs/1907.10641, 2019. URL http://arxiv.org/abs/1907.10641.
  44. N. Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
  45. Model evaluation for extreme risks, 2023. URL https://arxiv.org/abs/2305.15324.
  46. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  47. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  48. Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1.5/.
  49. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models, 2020. URL https://arxiv.org/abs/2008.05122.
  50. Llama: Open and efficient foundation language models, 2023.
  51. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  52. Ethical and social risks of harm from language models, 2021. URL https://arxiv.org/abs/2112.04359.
  53. xAI. grok-1, 2024. URL https://github.com/xai-org/grok-1.
  54. XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla.
  55. GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021. URL https://arxiv.org/abs/2105.04663.
  56. Intercode: Standardizing and benchmarking interactive coding with execution feedback, 2023. URL https://arxiv.org/abs/2306.14898.
  57. B. Zhang and R. Sennrich. Root mean square layer normalization. CoRR, abs/1910.07467, 2019. URL http://arxiv.org/abs/1910.07467.
  58. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (198)
  1. Gemma Team (3 papers)
  2. Shreya Pathak (12 papers)
  3. Pier Giuseppe Sessa (26 papers)
  4. Cassidy Hardin (5 papers)
  5. Surya Bhupatiraju (11 papers)
  6. Léonard Hussenot (25 papers)
  7. Thomas Mesnard (18 papers)
  8. Bobak Shahriari (16 papers)
  9. Alexandre Ramé (23 papers)
  10. Johan Ferret (24 papers)
  11. Peter Liu (4 papers)
  12. Pouya Tafti (5 papers)
  13. Abe Friesen (5 papers)
  14. Michelle Casbon (3 papers)
  15. Sabela Ramos (10 papers)
  16. Ravin Kumar (10 papers)
  17. Charline Le Lan (15 papers)
  18. Sammy Jerome (5 papers)
  19. Anton Tsitsulin (29 papers)
  20. Nino Vieillard (22 papers)
Citations (229)
Youtube Logo Streamline Icon: https://streamlinehq.com