Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the Crucial Role of Initialization for Matrix Factorization

Published 24 Oct 2024 in cs.LG, eess.SP, and math.OC | (2410.18965v3)

Abstract: This work revisits the classical low-rank matrix factorization problem and unveils the critical role of initialization in shaping convergence rates for such nonconvex and nonsmooth optimization. We introduce Nystrom initialization, which significantly improves the global convergence of Scaled Gradient Descent (ScaledGD) in both symmetric and asymmetric matrix factorization tasks. Specifically, we prove that ScaledGD with Nystrom initialization achieves quadratic convergence in cases where only linear rates were previously known. Furthermore, we extend this initialization to low-rank adapters (LoRA) commonly used for finetuning foundation models. Our approach, NoRA, i.e., LoRA with Nystrom initialization, demonstrates superior performance across various downstream tasks and model scales, from 1B to 7B parameters, in large language and diffusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. A convergence analysis of gradient descent for deep linear neural networks. In Proc. Int. Conf. on Machine Learning (ICML), 2018.
  2. LoRA-XS: Low-rank adaptation with extremely small number of parameters. arXiv:2405.17604, 2024.
  3. Piqa: Reasoning about physical commonsense in natural language. In Proc. AAAI Conf. Artif. Intel., pages 7432–7439, 2020.
  4. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical programming, 95(2):329–357, 2003.
  5. Kerim Büyükakyüz. OLoRA: Orthonormal low-rank adaptation of large language models. arXiv:2406.01775, 2024.
  6. Long-LoRA: Efficient fine-tuning of long-context large language models. In Proc. Int. Conf. on Learning Representations (ICLR), 2024.
  7. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Trans. Signal Processing, 67(20):5239–5269, 2019.
  8. François Chollet. On the measure of intelligence. arXiv:1911.01547, 2019.
  9. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv:1905.10044, 2019.
  10. Training verifiers to solve math word problems. arXiv:2110.14168, 2021.
  11. Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth. arXiv:2409.19791, 2024.
  12. The CommitmentBank: Investigating projection in naturally occurring discourse. Proc. Sinn und Bedeutung, 23(2):107–124, 2019.
  13. QLoRA: Efficient finetuning of quantized LLMs. In Proc. Neural Information Processing Systems (NeurIPS), volume 36, 2023.
  14. Fine-tuning and deploying large language models over edges: Issues and approaches. arXiv:2408.10691, 2024.
  15. Convex quartic problems: homogenized gradient method and preconditioning. arXiv:2306.17683, 2023.
  16. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In Proc. Neural Information Processing Systems (NeurIPS), volume 31, 2018.
  17. Randomized nyström preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2):718–752, 2023.
  18. Parameter-efficient fine-tuning with discrete fourier transform. In Proc. Int. Conf. on Machine Learning (ICML), 2024.
  19. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proc. Int. Conf. on Machine Learning (ICML), pages 1233–1242. PMLR, 2017.
  20. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024.
  21. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  22. Revisiting the nystrom method for improved large-scale machine learning. In Proc. Int. Conf. on Machine Learning (ICML), pages 567–575. PMLR, 2013.
  23. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. In Proc. Neural Information Processing Systems (NeurIPS), volume 36, 2023.
  24. FLORA: Low-rank adapters are secretly gradient compressors. In Proc. Int. Conf. on Machine Learning (ICML), 2024.
  25. The impact of initialization on lora finetuning dynamics. arXiv:2406.08447, 2024.
  26. Parameter-efficient transfer learning for NLP. In Proc. Int. Conf. on Machine Learning (ICML), pages 2790–2799, 2019.
  27. LoRA: Low-rank adaptation of large language models. In Proc. Int. Conf. on Learning Representations (ICLR), 2022.
  28. LLM-Adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2023.
  29. Global convergence of non-convex gradient descent for computing matrix squareroot. In Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS), pages 479–488. PMLR, 2017.
  30. LoRA training in the NTK regime has no spurious local minima. In Proc. Int. Conf. on Machine Learning (ICML), 2024.
  31. Preconditioning matters: Fast global convergence of non-convex matrix factorization via scaled gradient descent. In Proc. Neural Information Processing Systems (NeurIPS), 2023.
  32. How does adaptive optimization impact local neural network geometry? Proc. Neural Information Processing Systems (NeurIPS), 36, 2023a.
  33. Algorithmic regularization in model-free overparametrized asymmetric matrix factorization. SIAM Journal on Mathematics of Data Science, 5(3):723–744, 2023b.
  34. Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing. In Proc. Int. Conf. on Machine Learning (ICML), pages 15200–15238, 2023.
  35. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proc. Conf. Assoc. Comput. Linguist. Meet., pages 252–262, 2018.
  36. Adam: A method for stochastic optimization. In Proc. Int. Conf. on Learning Representations (ICLR), 2014.
  37. VeRA: Vector-based random matrix adaptation. In Proc. Int. Conf. on Learning Representations (ICLR), 2024.
  38. The winograd schema challenge. In Proc. intl. conf. on Principles of Knowledge Representation and Reasoning, 2012.
  39. Enhancing sharpness-aware optimization through variance suppression. volume 36, 2023.
  40. Implicit regularization of sharpness-aware minimization for scale-invariant problems. In Proc. Neural Information Processing Systems (NeurIPS), 2024a.
  41. Prefix-tuning: Optimizing continuous prompts for generation. In Proc. Conf. Assoc. Comput. Linguist. Meet., pages 4582–4597, 2021.
  42. LoftQ: LoRA-fine-tuning-aware quantization for large language models. In Proc. Int. Conf. on Learning Representations (ICLR), 2024b.
  43. ReLoRA: High-rank training through low-rank updates. In Proc. Int. Conf. on Learning Representations (ICLR), 2024.
  44. Svft: Parameter-efficient fine-tuning with singular vectors. arXiv:2405.19597, 2024.
  45. DoRA: Weight-decomposed low-rank adaptation. In Proc. Int. Conf. on Machine Learning (ICML), 2024.
  46. Decoupled weight decay regularization. In Proc. Int. Conf. on Learning Representations (ICLR), 2017.
  47. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
  48. Fine-tuning language models with just forward passes. In Proc. Neural Information Processing Systems (NeurIPS), volume 36, 2023.
  49. PiSSA: Principal singular values and singular vectors adaptation of large language models. arXiv:2404.02948, 2024.
  50. Can a suit of armor conduct electricity? A new dataset for open book question answering. arXiv:1809.02789, 2018.
  51. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2004.
  52. RoSA: Accurate parameter-efficient fine-tuning via robust adaptation. In Proc. Int. Conf. on Machine Learning (ICML), 2024.
  53. Pytorch: An imperative style, high-performance deep learning library. In Proc. Neural Information Processing Systems (NeurIPS), volume 32, 2019.
  54. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392, 2016.
  55. High-resolution image synthesis with latent diffusion models. In Proc. Conf. Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  56. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):1707–1739, 2009.
  57. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proc. Conf. Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023.
  58. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  59. Socialiqa: Commonsense reasoning about social interactions. arXiv:1904.09728, 2019.
  60. Continual diffusion: Continual customization of text-to-image diffusion with c-LoRA. arXiv:2304.06027, 2023.
  61. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642, 2013.
  62. Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. In Proc. Neural Information Processing Systems (NeurIPS), volume 34, pages 23831–23843, 2021.
  63. Understanding the dynamics of gradient flow in overparameterized linear models. In Proc. Int. Conf. on Machine Learning (ICML), pages 10153–10161, 2021.
  64. Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent. J. Mach. Learn. Res., 22(150):1–63, 2021.
  65. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
  67. Fixed-rank approximation of a positive-semidefinite matrix from streaming data. In Proc. Neural Information Processing Systems (NeurIPS), volume 30, 2017.
  68. Low-rank solutions of linear matrix equations via procrustes flow. In Proc. Int. Conf. on Machine Learning (ICML), pages 964–973. PMLR, 2016.
  69. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Proc. Neural Information Processing Systems (NeurIPS), volume 32, 2019a.
  70. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. Int. Conf. on Learning Representations (ICLR), 2019b.
  71. LoRA-GA: Low-rank adaptation with gradient approximation. arXiv:2407.05000, 2024.
  72. Convergence of alternating gradient descent for matrix factorization. In Proc. Neural Information Processing Systems (NeurIPS), volume 36, pages 22369–22382, 2023.
  73. Chain of LoRA: Efficient fine-tuning of language models via residual learning. arXiv:2401.04151, 2024.
  74. How over-parameterization slows down gradient descent in matrix sensing: The curses of symmetry and initialization. In Proc. Int. Conf. on Learning Representations (ICLR), 2024.
  75. The power of preconditioning in overparameterized low-rank matrix sensing. In Proc. Int. Conf. on Machine Learning (ICML), pages 38611–38654, 2023.
  76. Compressible dynamics in deep overparameterized low-rank learning & adaptation. In Proc. Int. Conf. on Machine Learning (ICML), 2024.
  77. Global convergence of gradient descent for asymmetric low-rank matrix factorization. In Proc. Neural Information Processing Systems (NeurIPS), volume 34, pages 1429–1439, 2021.
  78. MetaMath: Bootstrap your own mathematical questions for large language models. In Proc. Int. Conf. on Learning Representations (ICLR), 2024.
  79. Hellaswag: Can a machine really finish your sentence? arXiv:1905.07830, 2019.
  80. Riemannian preconditioned LoRA for fine-tuning foundation models. In Proc. Int. Conf. on Machine Learning (ICML), 2024.
  81. Preconditioned gradient descent for over-parameterized nonconvex matrix factorization. In Proc. Neural Information Processing Systems (NeurIPS), volume 34, pages 5985–5996, 2021.
  82. DPZero: Private fine-tuning of language models without backpropagation. In Proc. Int. Conf. on Machine Learning (ICML), 2024a.
  83. Adaptive budget allocation for parameter-efficient fine-tuning. In Proc. Int. Conf. on Learning Representations (ICLR), 2023.
  84. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv:1810.12885, 2018.
  85. OPT: Open pre-trained transformer language models. arXiv:2205.01068, 2022.
  86. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark. arXiv:2402.11592, 2024b.
  87. Asymmetry in low-rank adapters of foundation models. In Proc. Int. Conf. on Machine Learning (ICML), 2024.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.