Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models (2404.02827v3)

Published 3 Apr 2024 in cs.LG

Abstract: This work presents BAdam, an optimization method that leverages the block coordinate descent (BCD) framework with Adam's update rule. BAdam offers a memory efficient approach to the full parameter finetuning of LLMs. We conduct a theoretical convergence analysis for BAdam in the deterministic case. Experimentally, we apply BAdam to finetune the Llama 3-8B and Llama 3-70B models using a single RTX3090-24GB GPU and 4 A100-80GB GPUs, respectively. The results confirm BAdam's efficiency in terms of memory usage, running time, and optimization capability. Furthermore, the downstream performance evaluation based on MT-bench and math benchmarks shows that BAdam outperforms existing memory efficient baselines such as LoRA. It also demonstrates that BAdam can achieve comparable or even superior performance compared to Adam. Finally, the ablation study using SGD's update rule illustrates the suitability of BCD for finetuning LLMs. Our code can be easily integrated into any PyTorch-based codebase and is available at https://github.com/Ledzy/BAdam.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
  3. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In International Conference on Machine Learning, pages 3469–3494. PMLR, 2023.
  4. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  5. QLoRA: Efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems, 36, 2024.
  6. Towards a unified view of parameter-efficient transfer learning. International Conference on Learning Representations, 2021.
  7. Parameter-efficient transfer learning for NLP. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
  8. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  9. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  10. Nola: Networks as linear combination of low rank random basis. In The Twelfth International Conference on Learning Representations, 2024.
  11. VeRA: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, 2024.
  12. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  13. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4582–4597, 2021.
  14. ReLoRA: High-rank training through low-rank updates. In The Twelfth International Conference on Learning Representations, 2024.
  15. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  16. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  17. Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782, 2023.
  18. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36, 2023.
  19. Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  20. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023.
  21. Stanford alpaca: An instruction-following llama model, 2023.
  22. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  23. Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109:475–494, 2001.
  24. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  25. Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
  26. Chain of LoRA: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151, 2024.
  27. When scaling meets llm finetuning: The effect of data, model and finetuning method. The Twelfth International Conference on Learning Representations, 2024.
  28. Randomized coordinate subgradient method for nonsmooth optimization. arXiv preprint arXiv:2206.14981, 2022.
  29. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2023.
  30. LlamaFactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com