EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism (2312.04916v3)
Abstract: We present EE-LLM, a framework for large-scale training and inference of early-exit LLMs. While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.
- Layer normalization. ArXiv, abs/1607.06450, 2016.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. ArXiv, abs/2310.05424, 2023.
- On the opportunities and risks of foundation models. ArXiv, abs/2108.07258, 2021.
- Language models are few-shot learners. In NeurIPS, 2020.
- Data-juicer: A one-stop data processing system for large language models. ArXiv, abs/2309.02033, 2023.
- Training deep nets with sublinear memory cost. ArXiv, abs/1604.06174, 2016.
- Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. ArXiv, abs/2307.02628, 2023.
- Flashattention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
- Depth-adaptive transformer. In ICLR, 2020.
- DAPPLE: a pipelined data parallel approach for training large models. In PPoPP, pages 431–445, 2021.
- Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv, abs/1603.08983, 2016.
- Dynamic neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 44(11):7436–7456, 2022.
- Dynabert: Dynamic BERT with adaptive width and depth. In NeurIPS, 2020.
- Multi-scale dense networks for resource efficient image classification. In ICLR, 2018.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS, pages 103–112, 2019.
- Shallow-deep networks: Understanding and mitigating network overthinking. In ICML, volume 97, pages 3301–3310, 2019.
- Full stack optimization of transformer inference: a survey. ArXiv, abs/2302.14017, 2023.
- Reducing activation recomputation in large transformer models. ArXiv, abs/2205.05198, 2022.
- Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In SC, page 27, 2021.
- Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140 – 146, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. In ACL, pages 3214–3252, 2022.
- Fastbert: a self-distilling BERT with adaptive inference time. In ACL, pages 6035–6044, 2020.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In EMNLP, pages 1797–1807, 2018.
- Pipedream: generalized pipeline parallelism for DNN training. In SOSP, pages 1–15, 2019.
- Memory-efficient pipeline-parallel DNN training. In ICML, volume 139, pages 7937–7947, 2021.
- Efficient large-scale language model training on GPU clusters using megatron-lm. In SC, page 58, 2021.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. In MLSys, 2023.
- Efficiently scaling transformer inference. ArXiv, abs/2211.05102, 2022.
- Using the output embedding to improve language models. In EACL, pages 157–163, 2017.
- Improving language understanding by generative pre-training, 2018.
- Language models are unsupervised multitask learners, 2019.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pages 3505–3506, 2020.
- Confident adaptive language modeling. In NeurIPS, 2022.
- Consistent accelerated inference via confident adaptive transformers. In EMNLP, pages 4962–4979, 2021.
- The right tool for the job: Matching model and instance complexities. In ACL, pages 6640–6651, 2020.
- Get to the point: Summarization with pointer-generator networks. In ACL, pages 1073–1083, 2017.
- Mesh-tensorflow: Deep learning for supercomputers. In NIPS, pages 10435–10444, 2018.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053, 2019.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. ArXiv, abs/2201.11990, 2022.
- Deed: Dynamic early exit on decoder for accelerating encoder-decoder transformer models. ArXiv, abs/2311.08623, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Branchynet: Fast inference via early exiting from deep neural networks. In ICPR, pages 2464–2469, 2016.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
- Accelerating llama inference by enabling intermediate layer decoding via instruction tuning with lite. ArXiv, abs/2310.18581, 2023.
- Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
- SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, volume 202, pages 38087–38099, 2023.
- Deebert: Dynamic early exiting for accelerating BERT inference. In ACL, pages 2246–2251, 2020.
- Berxit: Early exiting for BERT with better fine-tuning and extension to regression. In EACL, pages 91–104, 2021.
- A survey on dynamic neural networks for natural language processing. In EACL, pages 2370–2381. Association for Computational Linguistics, 2023.
- Decentralized training of foundation models in heterogeneous environments. In NeurIPS, 2022.
- Root mean square layer normalization. In NeurIPS, pages 12360–12371, 2019.
- Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
- A survey of large language models. ArXiv, abs/2303.18223, 2023.
- Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In OSDI, pages 559–578, 2022.
- BERT loses patience: Fast and robust inference with early exit. In NeurIPS, 2020.
- Yanxi Chen (21 papers)
- Xuchen Pan (12 papers)
- Yaliang Li (117 papers)
- Bolin Ding (112 papers)
- Jingren Zhou (198 papers)