InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding (2401.09149v3)
Abstract: LLMs with long sequences begin to power more and more fundamentally new applications we use every day. Existing methods for long-sequence LLM training are neither efficient nor compatible with commonly-used training algorithms such as FlashAttention. We design InternEvo to address these issues. InternEvo decouples all of the sharding dimensions into a new hierarchical space, and systematically analyzes the memory and communication cost of LLM training. Then, it generates an effective hybrid parallelism strategy. We design a new selective overlap mechanism to mitigate the communication overhead introduced by the hybrid parallelism. We also implement memory management techniques to reduce GPU memory fragmentation. Evaluation results show that InternEvo generates parallelization strategies that match or outperform existing methods in model FLOPs utilization.
- Imagenet classification with deep convolutional neural networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012., pages 1106–1114, 2012.
- Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
- Scaling distributed machine learning with the parameter server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014, pages 583–598, 2014.
- Communication efficient distributed machine learning with the parameter server. In Proceedings of 28th Annual Conference on Neural Information Processing Systems, NeurIPS 2014., pages 19–27, 2014.
- Large scale distributed deep networks. In Proceedings of 26th Annual Conference on Neural Information Processing Systems, NeurIPS 2012., pages 1232–1240, 2012.
- Gpipe: efficient training of giant neural networks using pipeline parallelism. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, NIPS’19. Curran Associates Inc., 2019.
- Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of 17th European Conference on Computer Systems, EuroSys 2022, pages 472–487, 2022.
- Pytorch fsdp: Experiences on scaling fully sharded data parallel. CoRR, abs/2304.11277, 2023.
- Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. CoRR, abs/2201.12023, 2022.
- Recent advances in deep learning based dialogue systems: A systematic survey. Artificial intelligence review, 56(4):3055–3155, 2023.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14612–14620, 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Span-based localizing network for natural language video localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6543–6554, 2020.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
- Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
- Microsoft Azure Quantum Microsoft Research AI4Science. The impact of large language models on scientific discovery: a preliminary study using gpt-4. arXiv preprint arXiv:2311.07361, 2023.
- Survey: Transformer based video-language pre-training. AI Open, 3:1–13, 2022.
- Transformer-based deep learning for predicting protein properties in the life sciences. Elife, 12:e82819, 2023.
- Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19. Association for Computing Machinery, 2019.
- Efficient large-scale language model training on gpu clusters using megatron-lm. CoRR, abs/2104.04473, 2021.
- Beyond data and model parallelism for deep neural networks. CoRR, abs/1807.05358, 2018.
- Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. Proceedings of the VLDB Endowment, 16:470–479, 2022.
- Merak: An efficient distributed dnn training framework with automated 3d parallelism for giant foundation models. IEEE Transactions on Parallel and Distributed Systems, 34(5):1466–1478, 2023.
- Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’22, pages 267–284. USENIX Association, 2022.
- Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
- Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. CoRR, abs/2309.14509, 2023.
- Ring attention with blockwise transformers for near-infinite context. CoRR, abs/2310.01889, 2023.
- Lightseq: Sequence level parallelism for distributed training of long context transformers. arXiv preprint arXiv:2310.03294, 2023.
- Sequence parallelism: Long sequence training from system perspective. CoRR, abs/2105.13120, 2022.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. CoRR, abs/2205.14135, 2022.
- PyTorch. Accelerating large language models with accelerated transformers. https://pytorch.org/blog/accelerating-large-language-models/, 2023.
- Patrick von Platen. Optimizing your llm in production. https://huggingface.co/blog/optimize-llm, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20. Curran Associates Inc., 2020.
- Attention is all you need. In Advances in Neural Information Processing Systems, NeurIPS ’17, 2017.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2017.
- Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Zero++: Extremely efficient collective communication for giant model training. CoRR, abs/2306.10209, 2023.
- Mics: near-linear scaling for training gigantic model on public cloud. Proceedings of the VLDB Endowment, 16:37–50, 2022.
- Oneflow: Redesign the distributed deep learning framework from scratch. CoRR, abs/2110.15032, 2022.
- On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems, 5, 2023.
- NVIDIA Developers. Nccl user guide: Allreduce. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html#allreduce, 2023.
- NVIDIA Developers. Fast, flexible allocation for nvidia cuda with rapids memory manage. https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/, 2020.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, NeurIPS ’20, 2020.
- Gshard: Scaling giant models with conditional computation and automatic sharding. CoRR, abs/2006.16668, 2020.
- Qiaoling Chen (14 papers)
- Diandian Gu (5 papers)
- Guoteng Wang (6 papers)
- Xun Chen (166 papers)
- YingTong Xiong (5 papers)
- Ting Huang (26 papers)
- Qinghao Hu (31 papers)
- Xin Jin (285 papers)
- Yonggang Wen (84 papers)
- Tianwei Zhang (199 papers)
- Peng Sun (210 papers)