Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism (2312.04916v3)

Published 8 Dec 2023 in cs.LG, cs.AI, and cs.DC

Abstract: We present EE-LLM, a framework for large-scale training and inference of early-exit LLMs. While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Layer normalization. ArXiv, abs/1607.06450, 2016.
  2. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. ArXiv, abs/2310.05424, 2023.
  3. On the opportunities and risks of foundation models. ArXiv, abs/2108.07258, 2021.
  4. Language models are few-shot learners. In NeurIPS, 2020.
  5. Data-juicer: A one-stop data processing system for large language models. ArXiv, abs/2309.02033, 2023.
  6. Training deep nets with sublinear memory cost. ArXiv, abs/1604.06174, 2016.
  7. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2023.
  8. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  9. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. ArXiv, abs/2307.02628, 2023.
  10. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
  11. Depth-adaptive transformer. In ICLR, 2020.
  12. DAPPLE: a pipelined data parallel approach for training large models. In PPoPP, pages 431–445, 2021.
  13. Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv, abs/1603.08983, 2016.
  14. Dynamic neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 44(11):7436–7456, 2022.
  15. Dynabert: Dynamic BERT with adaptive width and depth. In NeurIPS, 2020.
  16. Multi-scale dense networks for resource efficient image classification. In ICLR, 2018.
  17. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS, pages 103–112, 2019.
  18. Shallow-deep networks: Understanding and mitigating network overthinking. In ICML, volume 97, pages 3301–3310, 2019.
  19. Full stack optimization of transformer inference: a survey. ArXiv, abs/2302.14017, 2023.
  20. Reducing activation recomputation in large transformer models. ArXiv, abs/2205.05198, 2022.
  21. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In SC, page 27, 2021.
  22. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140 – 146, 2023.
  23. Truthfulqa: Measuring how models mimic human falsehoods. In ACL, pages 3214–3252, 2022.
  24. Fastbert: a self-distilling BERT with adaptive inference time. In ACL, pages 6035–6044, 2020.
  25. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In EMNLP, pages 1797–1807, 2018.
  26. Pipedream: generalized pipeline parallelism for DNN training. In SOSP, pages 1–15, 2019.
  27. Memory-efficient pipeline-parallel DNN training. In ICML, volume 139, pages 7937–7947, 2021.
  28. Efficient large-scale language model training on GPU clusters using megatron-lm. In SC, page 58, 2021.
  29. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  30. Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. In MLSys, 2023.
  31. Efficiently scaling transformer inference. ArXiv, abs/2211.05102, 2022.
  32. Using the output embedding to improve language models. In EACL, pages 157–163, 2017.
  33. Improving language understanding by generative pre-training, 2018.
  34. Language models are unsupervised multitask learners, 2019.
  35. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pages 3505–3506, 2020.
  36. Confident adaptive language modeling. In NeurIPS, 2022.
  37. Consistent accelerated inference via confident adaptive transformers. In EMNLP, pages 4962–4979, 2021.
  38. The right tool for the job: Matching model and instance complexities. In ACL, pages 6640–6651, 2020.
  39. Get to the point: Summarization with pointer-generator networks. In ACL, pages 1073–1083, 2017.
  40. Mesh-tensorflow: Deep learning for supercomputers. In NIPS, pages 10435–10444, 2018.
  41. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053, 2019.
  42. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. ArXiv, abs/2201.11990, 2022.
  43. Deed: Dynamic early exit on decoder for accelerating encoder-decoder transformer models. ArXiv, abs/2311.08623, 2023.
  44. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  45. Branchynet: Fast inference via early exiting from deep neural networks. In ICPR, pages 2464–2469, 2016.
  46. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  47. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
  48. Accelerating llama inference by enabling intermediate layer decoding via instruction tuning with lite. ArXiv, abs/2310.18581, 2023.
  49. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  50. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, volume 202, pages 38087–38099, 2023.
  51. Deebert: Dynamic early exiting for accelerating BERT inference. In ACL, pages 2246–2251, 2020.
  52. Berxit: Early exiting for BERT with better fine-tuning and extension to regression. In EACL, pages 91–104, 2021.
  53. A survey on dynamic neural networks for natural language processing. In EACL, pages 2370–2381. Association for Computational Linguistics, 2023.
  54. Decentralized training of foundation models in heterogeneous environments. In NeurIPS, 2022.
  55. Root mean square layer normalization. In NeurIPS, pages 12360–12371, 2019.
  56. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
  57. A survey of large language models. ArXiv, abs/2303.18223, 2023.
  58. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In OSDI, pages 559–578, 2022.
  59. BERT loses patience: Fast and robust inference with early exit. In NeurIPS, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yanxi Chen (21 papers)
  2. Xuchen Pan (12 papers)
  3. Yaliang Li (117 papers)
  4. Bolin Ding (112 papers)
  5. Jingren Zhou (198 papers)
Citations (18)

Summary

An Overview of "EE-LLM: Large-Scale Training and Inference of Early-Exit LLMs with 3D Parallelism"

The paper "EE-LLM: Large-Scale Training and Inference of Early-Exit LLMs with 3D Parallelism" presents a sophisticated framework for advancing the training and inference of LLMs through early-exit paradigms. This research focuses on mitigating the computational and energy-intensive nature of LLMs by leveraging early-exit strategies, accelerating inference without compromising accuracy, and implementing 3D parallelism for large-scale deployment.

Main Contributions

The paper effectively tackles the hurdles in training and inference with early-exit LLMs by proposing a series of algorithmic innovations. These include:

  1. Backpropagation Through Pipeline Stages: The authors introduce an innovative method to support backpropagation in a pipeline-parallel setup across various exits. This is crucial as existing parallelism frameworks like Megatron-LM don't natively support cross-pipeline loss aggregation required for early-exit models.
  2. Efficiency Optimizations: EE-LLM includes several performance optimizations that minimize the computational overhead introduced by early-exit layers. The paper identifies and efficiently utilizes idle computational resources created due to pipeline bubbles, balancing workload distribution across pipeline stages, which is crucial for enhancing training and memory efficiency.
  3. Inference with KV Caching Compatibility: Two methods are devised to resolve conflicts between early-exit inference and key-value (KV) caching, both critical for autoregressive sequence generation tasks. The first employs a novel form of pipeline parallelism that allows for concurrent token generation, and the second utilizes KV recomputation to maintain speed and consistency.
  4. Model Architecture Flexibility: EE-LLM offers flexibility in configuring early-exit layer structures and their distribution across the model pipeline, providing researchers the tools to balance complexity and computational savings.

Analytical and Empirical Insights

The research presents both analytical and empirical analyses to support the claim of achieving higher training efficiency with minimal overhead. Empirical results confirm that training time increases marginally with the introduction of early-exit layers, while peak memory usage can remain constant or negatively impacted when optimally positioned stages are utilized. This is significant because it means early-exit models can scale up to sizes comparable to conventional LLMs within the constraints of existing computing resources.

In terms of inference, the development of a pipeline-based method ensures that the advantages of early-exit models can be fully realized without delaying future token generation due to missing KV caches. Analysis reveals that this method provides substantial speedup in sequence generation with minimal impact on final output quality.

Implications and Future Directions

EE-LLM extends the application scope of 3D parallelism by incorporating early-exit logic into the training and deployment of LLMs. The implications of such advancements are broad, suggesting that massive LLMs, previously limited by computational resource constraints, can be trained and deployed more efficiently.

The theoretical and practical contributions provide a foundation for future exploration into optimizing various checkpoints in model training and inference. The advanced support for variable configurations within early-exit frameworks opens additional avenues for exploring adaptive computation in real-time, allowing for dynamic model scalability. Moreover, future explorations could further integrate early-exit mechanisms with other conditional computation strategies like sparse mixtures of experts to amplify efficiency gains.

The EE-LLM framework bridges a critical gap in LLM scaling by aligning early-exit benefits with 3D parallelism, emphasizing both the immediate and broader adoption potential of early-exit strategies in the AI community. These insights and methods potentially redefine the computational economics of deploying large-scale LLMs in diverse applications.