HLAT: High-quality Large Language Model Pre-trained on AWS Trainium (2404.10630v2)
Abstract: Getting LLMs to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML led to a scarcity of the expensive conventional accelerators (such as GPUs), which emphasizes the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium is the second-generation machine learning accelerator purposely built for training large deep learning models. However, training LLMs with billions of parameters on AWS Trainium is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a family of 7B and 70B decoder-only LLMs pre-trained using 4096 AWS Trainium accelerators over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines of similar model size. We also open-source all the training scripts and configurations of HLAT (https://github.com/awslabs/HLAT) and share the best practice of using the NeuronX Distributed Training (NxDT), a customized distributed training library for AWS Trainium. Our work demonstrates that AWS Trainium powered by NxDT is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.
- TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
- The Falcon Series of Open Language Models. arXiv:2311.16867 [cs.CL]
- PaLM 2 Technical Report. arXiv:2305.10403 [cs.CL]
- Apache Arrow. 2020. Apache Arrow, a crosslanguage development platform for in-memory analytics. https://arrow.apache.org/.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439.
- Distributed Inference and Fine-tuning of Large Language Models Over The Internet. Advances in Neural Information Processing Systems 36 (2024).
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023).
- Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG]
- Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174 [cs.LG]
- BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv abs/1803.05457 (2018). https://api.semanticscholar.org/CorpusID:3922816
- Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168 (2021).
- Together Computer. 2023. RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset. https://github.com/togethercomputer/RedPajama-Data
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- FairScale authors. 2021. FairScale: A general purpose modular PyTorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale.
- A framework for few-shot language model evaluation. Version v0. 0.1. Sept (2021).
- Xinyang Geng and Hao Liu. 2023. OpenLLaMA: An Open Reproduction of LLaMA. https://github.com/openlm-research/open_llama
- OLMo: Accelerating the Science of Language Models. arXiv:2402.00838 [cs.CL]
- Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 1737–1746.
- NeMo: a toolkit for Conversational AI and Large Language Models. https://github.com/NVIDIA/NeMo
- Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
- Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
- How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210 (2023).
- Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
- Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv:1902.09506 [cs.CL]
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017).
- Reducing Activation Recomputation in Large Transformer Models. arXiv:2205.05198 [cs.LG]
- Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics (2019).
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
- Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273 (2022).
- StarCoder: may the source be with you! (2023). arXiv:2305.06161 [cs.CL]
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs.CL]
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
- Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).
- N.A. [n. d.]. Paper under review.
- CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online.
- OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https://arxiv.org/abs/2303.08774
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). arXiv:2306.01116 https://arxiv.org/abs/2306.01116
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. ArXiv. https://www.microsoft.com/en-us/research/publication/zero-memory-optimizations-toward-training-trillion-parameter-models/
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
- Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM 64, 9 (2021), 99–106.
- Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020).
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv preprint arXiv:2210.09261 (2022).
- Spike No More: Stabilizing the Pre-training of Large Language Models. arXiv:2312.16903 [cs.CL]
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219 (2023).
- A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2023). http://arxiv.org/abs/2303.18223
- Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).
- Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
- Haozheng Fan (4 papers)
- Hao Zhou (351 papers)
- Guangtai Huang (3 papers)
- Parameswaran Raman (11 papers)
- Xinwei Fu (6 papers)
- Gaurav Gupta (44 papers)
- Dhananjay Ram (10 papers)
- Yida Wang (62 papers)
- Jun Huan (31 papers)