Nemotron-4 15B Technical Report (2402.16819v2)

Published 26 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual LLM trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.

References (50)

Authors (27)

Jupinder Parmar (10 papers)
Shrimai Prabhumoye (40 papers)
Joseph Jennings (10 papers)
Mostofa Patwary (34 papers)
Sandeep Subramanian (24 papers)
Dan Su (101 papers)
Chen Zhu (103 papers)
Deepak Narayanan (26 papers)
Aastha Jhunjhunwala (5 papers)
Ayush Dattagupta (3 papers)
Vibhu Jawa (2 papers)
Jiwei Liu (5 papers)
Ameya Mahabaleshwarkar (1 paper)
Osvald Nitski (4 papers)
Annika Brundyn (4 papers)
James Maki (2 papers)
Miguel Martinez (19 papers)
Jiaxuan You (51 papers)
John Kamalu (8 papers)
Patrick LeGresley (7 papers)

Citations (18)

View on Semantic Scholar

Summary

Comprehensive Analysis of Nemotron-4 15B: A Multifaceted LLM

Introduction to Nemotron-4 15B

The landscape of LLM development has seen a notable shift, focusing on balancing model size with the comprehensiveness of training data—a principle supported by the Chinchilla scaling laws. Within this context, the Nemotron-4 15B emerges as a significant contribution to the field. This 15-billion-parameter model trained on a colossal dataset of 8 trillion tokens not only sets a new benchmark in multilingual and coding task performance but also competes strongly across various English evaluation benchmarks.

Architecture and Training Strategy

Nemotron-4 15B operates on a transformer architecture with causal attention masks, refined by strategic choices in its hyperparameters. The model's configuration, including the adoption of Rotary Position Embeddings and a SentencePiece tokenizer, contributes to its enhanced capabilities. Its training leveraged a blend of English, multilingual, and coding data, with careful deduplication and quality filtering to ensure the robustness of the training corpus.

The model utilized advanced training methodologies, employing 384 NVIDIA H100 nodes under a schema that optimized for both tensor and data parallelism. These measures, combined with nuanced batch size adjustments and a methodical training schedule, enabled the model to reach its full potential in a remarkably efficient timeframe.

Empirical Evaluation

Nemotron-4 15B's performance was rigorously evaluated across a diverse range of tasks. Its proficiency in commonsense reasoning, aggregated benchmarks (viz., MMLU and BBH), mathematical reasoning, coding tasks, and multilingual benchmarks underscore its superior capability and versatility.

Commonsense Reasoning: Nemotron-4 15B demonstrated robust performance, outperforming several prominent models in average scores.
Popular Aggregated Benchmarks: It achieved remarkable success on BBH, surpassing other models in its scale by a significant margin.
Math and Code: The model showed commendable results, especially highlighting its superiority in handling low-resource programming languages when compared against specialized code models.
Multilingual Competencies: Nemotron-4 15B excelled in its multilingual capabilities, showcasing superior performance over models trained explicitly for multilingual tasks. Its performance on tasks such as XCOPA, TyDiQA-GoldP, MGSM, and FLORES-101 validates its exceptional understanding and generative abilities across languages.

Implications and Future Directions

The success of Nemotron-4 15B on a spectrum of benchmarks underscores the efficacy of scaling data alongside model parameters within a computational budget. It also emphasizes the potential of general-purpose models in surpassing specialized models across diverse domains, provided that the training data is sufficiently expansive and diverse. From a practical standpoint, Nemotron-4 15B's efficiency and scalability suggest its applicability in real-world scenarios, potentially reducing the latency and computational demands of deploying LLMs.

Theoretically, the findings contribute to our understanding of LLM training dynamics, offering empirical evidence that supports the Chinchilla scaling laws. For future research, the performance of Nemotron-4 15B opens up avenues to explore further optimizations in training regimes, architectural innovations, and the integration of even more diverse data sources to enhance model performance across an expanded range of languages and tasks.

In conclusion, Nemotron-4 15B represents a significant stride in LLM development, combining efficiency with enhanced multilingual and coding capabilities. Its achievements hint at an exciting trajectory for future research in AI and natural language processing, promising vast potential applications and deeper insights into the machinations of large-scale LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1762338499619852555

https://twitter.com/shrimai_/status/1762526290383708598

https://twitter.com/_akhaliq/status/1762341551210889507

https://twitter.com/johannes_hage/status/1770959661459451976

https://twitter.com/TheTuringPost/status/1765135220335141020

https://twitter.com/jupi_parmar/status/1762618571505766493

YouTube

Show All Videos

HackerNews

Nemotron-4 15B large multilingual language model trained on 8T tokens (3 points, 1 comment)