2000 character limit reached
Nemotron-4 15B Technical Report (2402.16819v2)
Published 26 Feb 2024 in cs.CL, cs.AI, and cs.LG
Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual LLM trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.
- NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- SantaCoder: Don’t Reach for the Stars!, 2023.
- Program Synthesis with Large Language Models, 2021.
- Qwen Technical Report. arXiv preprint arXiv:2309.16609, 2023.
- PIQA: Reasoning about Physical Commonsense in Natural Language. In AAAI, 2020.
- Language Models are Few-Shot Learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering, pages 1–17, 2023a. doi: 10.1109/TSE.2023.3267446.
- Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023b.
- Orion-14B: Open-source Multilingual Large Language Models, 2024.
- Evaluating Large Language Models Trained on Code, 2021.
- PaLM: Scaling Language Modeling with Pathways. arXiv preprint arXiv:2204.02311, 2022.
- TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. CoRR, abs/2003.05002, 2020. URL https://arxiv.org/abs/2003.05002.
- Think You have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
- A Framework for Few-shot Language Model Evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Google DeepMind Gemma Team. Gemma: Open Models Based on Gemini Research and Technology, 2024.
- Google. Gemini: A Family of Highly Capable Multimodal Models, 2023.
- The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. CoRR, abs/2106.03193, 2021. URL https://arxiv.org/abs/2106.03193.
- Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300, 2020.
- Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556, 2022.
- Curating Trillion-Token Datasets: Introducing NVIDIA NeMo Data Curator. https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/, 2023.
- Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
- Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361, 2020.
- Reducing Activation Recomputation in Large Transformer Models, 2022.
- Sentencepiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv preprint arXiv:1808.06226, 2018.
- StarCoder: May the Source be with You!, 2023.
- Few-shot Learning with Multilingual Language Models, 2022.
- NVIDIA. H100 Tensor Core GPU Architecture Overview, 2022.
- BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
- XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. CoRR, abs/2005.00333, 2020. URL https://arxiv.org/abs/2005.00333.
- Matt Post. A Call for Clarity in Reporting BLEU Scores. CoRR, abs/1804.08771, 2018. URL http://arxiv.org/abs/1804.08771.
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher, 2022.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale. In AAAI, 2020.
- Socialiqa: Commonsense reasoning about social interactions, 2019.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, 2023.
- Language Models are Multilingual Chain-of-Thought Reasoners, 2022.
- mGPT: Few-Shot Learners Go Multilingual, 2022.
- Megatron-LM: Training Multi-Billion Parameter Language Models using Model Parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Andrew M. Dai Slav Petrov, Yonghui Wu and et al. PaLM 2 Technical Report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf.
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. CoRR, abs/2201.11990, 2022. URL https://arxiv.org/abs/2201.11990.
- Roformer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864, 2021.
- Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
- LLaMA: Open and Efficient Foundation Language Models, 2023a.
- Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. arXiv preprint arXiv:1911.00359, 2019.
- Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305, 2023.
- HellaSwag: Can a Machine Really Finish Your Sentence? In ACL, 2019.
- Jupinder Parmar (10 papers)
- Shrimai Prabhumoye (40 papers)
- Joseph Jennings (10 papers)
- Mostofa Patwary (34 papers)
- Sandeep Subramanian (24 papers)
- Dan Su (101 papers)
- Chen Zhu (103 papers)
- Deepak Narayanan (26 papers)
- Aastha Jhunjhunwala (5 papers)
- Ayush Dattagupta (3 papers)
- Vibhu Jawa (2 papers)
- Jiwei Liu (5 papers)
- Ameya Mahabaleshwarkar (1 paper)
- Osvald Nitski (4 papers)
- Annika Brundyn (4 papers)
- James Maki (2 papers)
- Miguel Martinez (19 papers)
- Jiaxuan You (51 papers)
- John Kamalu (8 papers)
- Patrick LeGresley (7 papers)