DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation (2402.17812v2)
Abstract: LLMs have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and activation memory. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules generated by undropped layers and residual connections. Additionally, DropBP calculates the sensitivity of each layer to assign an appropriate drop rate, thereby stabilizing the training process. DropBP is not only applicable to full fine-tuning but can also be orthogonally integrated with all types of PEFT by dropping layers during backward propagation. Specifically, DropBP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5x, and enable training with a sequence length 6.2x larger on a single NVIDIA-A100 GPU. Furthermore, our DropBP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. The code is available at https://github.com/WooSunghyeon/dropbp.
- Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023.
- Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge. CoRR, abs/2102.03315, 2021.
- PIQA: reasoning about physical commonsense in natural language. In AAI, New York, NY, USA, February 7-12, 2020, pp. 7432–7439. AAAI Press, 2020.
- Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, virtual, December 6-12, 2020, 2020.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
- Qlora: Efficient finetuning of quantized llms. CoRR, abs/2305.14314, 2023.
- A framework for few-shot language model evaluation, 12 2023a.
- Llama-adapter V2: parameter-efficient visual instruction model. CoRR, abs/2304.15010, 2023b.
- Pipedream: Fast and efficient pipeline parallel DNN training. CoRR, abs/1806.03377, 2018.
- Measuring massive multitask language understanding. In ICLR, virtual, Austria, May 3-7, 2021. OpenReview.net, 2021.
- An empirical analysis of compute-optimal large language model training. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), NeurIPS, New Orleans, LA, USA November 28 - December 9, 2022, 2022.
- Lora: Low-rank adaptation of large language models. In ICLR, April 25-29, 2022, virtual. OpenReview.net, 2022.
- Deep networks with stochastic depth. In Leibe, B., Matas, J., Sebe, N., and Welling, M. (eds.), ECCV, Amsterdam, The Netherlands, October 11-14, 2016, volume 9908 of Lecture Notes in Computer Science, pp. 646–661. Springer, 2016.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS 2019, Vancouver, BC, Canada, VDecember 8-14, 2019, pp. 103–112, 2019.
- Scaling laws for neural language models. CoRR, abs/2001.08361, 2020.
- Kelley, H. J. Gradient theory of optimal flight paths. Ars Journal, 30(10):947–954, 1960.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 2023a.
- Bpipe: Memory-balanced pipeline parallelism for training large language models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), ICML, Honolulu, Hawaii, USA, 23-29 July 2023, volume 202 of Proceedings of Machine Learning Research, pp. 16639–16653. PMLR, 2023b.
- Reducing activation recomputation in large transformer models. CoRR, abs/2205.05198, 2022.
- Alphatuning: Quantization-aware parameter-efficient adaptation of large-scale pre-trained language models. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), EMNLP, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 3288–3305. Association for Computational Linguistics, 2022.
- Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, 2020.
- Lightning-AI. Lit-gpt. https://github.com/Lightning-AI/lit-gpt, 2023.
- GACT: activation compressed training for generic network architectures. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), ICML, Baltimore, Maryland, USA, 17-23 July 2022, volume 162 of Proceedings of Machine Learning Research, pp. 14139–14152. PMLR, 2022.
- SGDR: stochastic gradient descent with warm restarts. In ICL, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- Decoupled weight decay regularization. In ICLR, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- Mixed precision training. CoRR, abs/1710.03740, 2017.
- Can a suit of armor conduct electricity? A new dataset for open book question answering. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), EMNLP, Brussels, Belgium, October 31 - November 4, 2018, pp. 2381–2391. Association for Computational Linguistics, 2018.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS 2019, Vancouver, BC, Canada, December 8-14, 2019, pp. 8024–8035, 2019.
- Zero: memory optimizations toward training trillion parameter models. In Cuicchi, C., Qualters, I., and Kramer, W. T. (eds.), Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, pp. 20. IEEE/ACM, 2020.
- Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8732–8740. AAAI Press, 2020.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
- Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. CoRR, abs/2201.11990, 2022.
- Benchmarking the nvidia gpu lineage: From early k80 to modern a100 with asynchronous memory transfers. arXiv preprint arXiv:2106.04979, 2021.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b.
- Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Neurips, Long Beach, CA, USA, December 4-9, 2017,, pp. 5998–6008, 2017.
- Finetuned language models are zero-shot learners. In ICLR, virtual, April 25-29, 2022. OpenReview.net, 2022.
- ALAM: Averaged low-precision activation for memory-efficient training of transformer models. In The Twelfth International Conference on Learning Representations, 2024.
- Qa-lora: Quantization-aware low-rank adaptation of large language models. CoRR, abs/2309.14717, 2023.
- Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), ACL, Florence, Italy, July 28- August 2, 2019, pp. 4791–4800. Association for Computational Linguistics, 2019.
- Acceleration of large transformer model training by sensitivity-based layer dropping. In Williams, B., Chen, Y., and Neville, J. (eds.), AAAI, Washington, DC, USA, Thirteenth Symposium on Educational Advances in Artificial Intelligence, February 7-14, 2023, pp. 11156–11163. AAAI Press, 2023.
- Accelerating training of transformer-based language models with progressive layer dropping. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, virtual, December 6-12, 2020, 2020.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. CoRR, abs/2303.16199, 2023.
- Pytorch FSDP: experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848–3860, 2023.
- LIMA: less is more for alignment. CoRR, abs/2305.11206, 2023.