Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Nemotron-4 15B Technical Report (2402.16819v2)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual LLM trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.

Comprehensive Analysis of Nemotron-4 15B: A Multifaceted LLM

Introduction to Nemotron-4 15B

The landscape of LLM development has seen a notable shift, focusing on balancing model size with the comprehensiveness of training data—a principle supported by the Chinchilla scaling laws. Within this context, the Nemotron-4 15B emerges as a significant contribution to the field. This 15-billion-parameter model trained on a colossal dataset of 8 trillion tokens not only sets a new benchmark in multilingual and coding task performance but also competes strongly across various English evaluation benchmarks.

Architecture and Training Strategy

Nemotron-4 15B operates on a transformer architecture with causal attention masks, refined by strategic choices in its hyperparameters. The model's configuration, including the adoption of Rotary Position Embeddings and a SentencePiece tokenizer, contributes to its enhanced capabilities. Its training leveraged a blend of English, multilingual, and coding data, with careful deduplication and quality filtering to ensure the robustness of the training corpus.

The model utilized advanced training methodologies, employing 384 NVIDIA H100 nodes under a schema that optimized for both tensor and data parallelism. These measures, combined with nuanced batch size adjustments and a methodical training schedule, enabled the model to reach its full potential in a remarkably efficient timeframe.

Empirical Evaluation

Nemotron-4 15B's performance was rigorously evaluated across a diverse range of tasks. Its proficiency in commonsense reasoning, aggregated benchmarks (viz., MMLU and BBH), mathematical reasoning, coding tasks, and multilingual benchmarks underscore its superior capability and versatility.

  • Commonsense Reasoning: Nemotron-4 15B demonstrated robust performance, outperforming several prominent models in average scores.
  • Popular Aggregated Benchmarks: It achieved remarkable success on BBH, surpassing other models in its scale by a significant margin.
  • Math and Code: The model showed commendable results, especially highlighting its superiority in handling low-resource programming languages when compared against specialized code models.
  • Multilingual Competencies: Nemotron-4 15B excelled in its multilingual capabilities, showcasing superior performance over models trained explicitly for multilingual tasks. Its performance on tasks such as XCOPA, TyDiQA-GoldP, MGSM, and FLORES-101 validates its exceptional understanding and generative abilities across languages.

Implications and Future Directions

The success of Nemotron-4 15B on a spectrum of benchmarks underscores the efficacy of scaling data alongside model parameters within a computational budget. It also emphasizes the potential of general-purpose models in surpassing specialized models across diverse domains, provided that the training data is sufficiently expansive and diverse. From a practical standpoint, Nemotron-4 15B's efficiency and scalability suggest its applicability in real-world scenarios, potentially reducing the latency and computational demands of deploying LLMs.

Theoretically, the findings contribute to our understanding of LLM training dynamics, offering empirical evidence that supports the Chinchilla scaling laws. For future research, the performance of Nemotron-4 15B opens up avenues to explore further optimizations in training regimes, architectural innovations, and the integration of even more diverse data sources to enhance model performance across an expanded range of languages and tasks.

In conclusion, Nemotron-4 15B represents a significant stride in LLM development, combining efficiency with enhanced multilingual and coding capabilities. Its achievements hint at an exciting trajectory for future research in AI and natural language processing, promising vast potential applications and deeper insights into the machinations of large-scale LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/.
  2. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  3. SantaCoder: Don’t Reach for the Stars!, 2023.
  4. Program Synthesis with Large Language Models, 2021.
  5. Qwen Technical Report. arXiv preprint arXiv:2309.16609, 2023.
  6. PIQA: Reasoning about Physical Commonsense in Natural Language. In AAAI, 2020.
  7. Language Models are Few-Shot Learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  8. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering, pages 1–17, 2023a. doi: 10.1109/TSE.2023.3267446.
  9. Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023b.
  10. Orion-14B: Open-source Multilingual Large Language Models, 2024.
  11. Evaluating Large Language Models Trained on Code, 2021.
  12. PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:2204.02311, 2022.
  13. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. CoRR, abs/2003.05002, 2020. URL https://arxiv.org/abs/2003.05002.
  14. Think You have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457, 2018.
  15. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  16. A Framework for Few-shot Language Model Evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  17. Google DeepMind Gemma Team. Gemma: Open Models Based on Gemini Research and Technology, 2024.
  18. Google. Gemini: A Family of Highly Capable Multimodal Models, 2023.
  19. The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. CoRR, abs/2106.03193, 2021. URL https://arxiv.org/abs/2106.03193.
  20. Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300, 2020.
  21. Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556, 2022.
  22. Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator. https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/, 2023.
  23. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
  24. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361, 2020.
  25. Reducing Activation Recomputation in Large Transformer Models, 2022.
  26. Sentencepiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv preprint arXiv:1808.06226, 2018.
  27. StarCoder: May the Source be with You!, 2023.
  28. Few-shot Learning with Multilingual Language Models, 2022.
  29. NVIDIA. H100 Tensor Core GPU Architecture Overview, 2022.
  30. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
  31. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. CoRR, abs/2005.00333, 2020. URL https://arxiv.org/abs/2005.00333.
  32. Matt Post. A Call for Clarity in Reporting BLEU Scores. CoRR, abs/1804.08771, 2018. URL http://arxiv.org/abs/1804.08771.
  33. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, 2022.
  34. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  35. WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale. In AAAI, 2020.
  36. Socialiqa: Commonsense reasoning about social interactions, 2019.
  37. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, 2023.
  38. Language Models are Multilingual Chain-of-Thought Reasoners, 2022.
  39. mGPT: Few-Shot Learners Go Multilingual, 2022.
  40. Megatron-LM: Training Multi-Billion Parameter Language Models using Model Parallelism. arXiv preprint arXiv:1909.08053, 2019.
  41. Andrew M. Dai Slav Petrov, Yonghui Wu and et al. PaLM 2 Technical Report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf.
  42. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. CoRR, abs/2201.11990, 2022. URL https://arxiv.org/abs/2201.11990.
  43. Roformer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864, 2021.
  44. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  45. LLaMA: Open and Efficient Foundation Language Models, 2023a.
  46. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023b.
  47. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  48. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. arXiv preprint arXiv:1911.00359, 2019.
  49. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305, 2023.
  50. HellaSwag: Can a Machine Really Finish Your Sentence? In ACL, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (27)
  1. Jupinder Parmar (10 papers)
  2. Shrimai Prabhumoye (40 papers)
  3. Joseph Jennings (10 papers)
  4. Mostofa Patwary (34 papers)
  5. Sandeep Subramanian (24 papers)
  6. Dan Su (101 papers)
  7. Chen Zhu (103 papers)
  8. Deepak Narayanan (26 papers)
  9. Aastha Jhunjhunwala (5 papers)
  10. Ayush Dattagupta (3 papers)
  11. Vibhu Jawa (2 papers)
  12. Jiwei Liu (5 papers)
  13. Ameya Mahabaleshwarkar (1 paper)
  14. Osvald Nitski (4 papers)
  15. Annika Brundyn (4 papers)
  16. James Maki (2 papers)
  17. Miguel Martinez (19 papers)
  18. Jiaxuan You (51 papers)
  19. John Kamalu (8 papers)
  20. Patrick LeGresley (7 papers)
Citations (18)
Youtube Logo Streamline Icon: https://streamlinehq.com