Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning (2402.15751v1)

Published 24 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: While fine-tuning LLMs for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, the quality of gradient estimates in zeroth order optimization often depends on the data dimensionality, potentially explaining why MeZO still exhibits significant performance drops compared to standard fine-tuning across various tasks. Inspired by the success of Parameter-Efficient Fine-Tuning (PEFT), this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet effective parameter selection scheme that yields significant performance gains with Sparse-MeZO. Additionally, we develop a memory-optimized implementation for sparse masking, ensuring the algorithm requires only inference-level memory consumption, allowing Sparse-MeZO to fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead. For example, it achieves a 9\% absolute accuracy improvement and 3.5x speedup over MeZO on the RTE task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. Advances in Neural Information Processing Systems, 31, 2018.
  2. The fifth pascal recognizing textual entailment challenge. TAC, 7(8):1, 2009.
  3. Freezeout: Accelerate training by progressively freezing layers. arXiv preprint arXiv:1706.04983, 2017.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Tinytl: Reduce activations, not trainable parameters for efficient on-device learning. arXiv preprint arXiv:2007.11622, 2020.
  6. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In International Conference on Machine Learning, pp.  1193–1203. PMLR, 2021.
  7. Zeroth-order regularized optimization (zoro): Approximately sparse gradients and adaptive sampling. SIAM Journal on Optimization, 32(2):687–714, 2022.
  8. Deepzero: Scaling up zeroth-order optimization for deep model training. arXiv preprint arXiv:2310.02025, 2023.
  9. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp.  15–26, 2017.
  10. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  11. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pp.  177–190. Springer, 2005.
  12. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
  13. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  14. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp.  1–9, 2007.
  15. Efficient on-chip learning for optical neural networks through power-aware sparse zeroth-order optimization. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  7583–7591, 2021.
  16. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7, pp.  785–794, 2006.
  17. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
  18. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015b.
  19. Comparing biases for minimal network construction with back-propagation. Advances in neural information processing systems, 1, 1988.
  20. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992.
  21. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  23. Black-box adversarial attacks with limited queries and information. In International conference on machine learning, pp.  2137–2146. PMLR, 2018.
  24. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  252–262, 2018.
  25. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  26. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  27. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  28. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020.
  29. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  30. Zeroth-order stochastic variance reduction for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018.
  31. signsgd via zeroth-order oracle. In International conference on learning representations. International Conference on Learning Representations, ICLR, 2019.
  32. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications. IEEE Signal Processing Magazine, 37(5):43–54, 2020.
  33. Autofreeze: Automatically freezing model blocks to accelerate fine-tuning. arXiv preprint arXiv:2102.01386, 2021.
  34. Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
  35. Sparse perturbations for improved convergence in stochastic zeroth-order optimization. In Machine Learning, Optimization, and Data Science: 6th International Conference, LOD 2020, Siena, Italy, July 19–23, 2020, Revised Selected Papers, Part II 6, pp.  39–64. Springer, 2020.
  36. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121, 2018.
  37. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017.
  38. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  39. Learning to learn by zeroth-order oracle. arXiv preprint arXiv:1910.09464, 2019.
  40. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  41. Spall, J. C. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332–341, 1992.
  42. Ström, N. Phoneme probability estimation with dynamic sparsely connected artificial neural networks. The Free Speech Journal, 5(1-41):2, 1997.
  43. Bbtv2: towards a gradient-free future with large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3916–3930, 2022a.
  44. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, pp.  20841–20855. PMLR, 2022b.
  45. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005, 2022.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  47. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  742–749, 2019.
  48. Contrasting exploration in parameter and action space: A zeroth-order optimization perspective. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  2926–2935. PMLR, 2019.
  49. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  50. Zarts: On zero-order optimization for neural architecture search. Advances in Neural Information Processing Systems, 35:12868–12880, 2022.
  51. Stochastic zeroth-order optimization in high dimensions. In International conference on artificial intelligence and statistics, pp.  1356–1365. PMLR, 2018.
  52. Hessian-aware zeroth-order optimization for black-box adversarial attack. arXiv preprint arXiv:1812.11377, 2018.
  53. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  54. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.
  55. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  56. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yong Liu (721 papers)
  2. Zirui Zhu (6 papers)
  3. Chaoyu Gong (5 papers)
  4. Minhao Cheng (43 papers)
  5. Cho-Jui Hsieh (211 papers)
  6. Yang You (173 papers)
Citations (5)