Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models (2404.02948v3)

Published 3 Apr 2024 in cs.LG and cs.AI

Abstract: To parameter-efficiently fine-tune (PEFT) LLMs, the low-rank adaptation (LoRA) method approximates the model changes $\Delta W \in \mathbb{R}{m \times n}$ through the product of two matrices $A \in \mathbb{R}{m \times r}$ and $B \in \mathbb{R}{r \times n}$, where $r \ll \min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA freezes the original model $W$ and updates the "Noise & Zero" adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W{res} \in \mathbb{R}{m \times n}$ which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA (PiSSA with 4-bit quantization) exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA.

PiSSA: Enhancing LLMs via Principal Singular values and Singular vectors Adaptation

Introduction to PiSSA

Recent advancements in the field of LLMs, notably their efficacy in diverse tasks, have led to an escalated interest in fine-tuning methodologies. Given the prohibitive computational costs associated with full-model fine-tuning of LLMs, parameter-efficient fine-tuning (PEFT) methods have emerged. Among these, the Principal Singular values and Singular vectors Adaptation (PiSSA) has been introduced as a novel technique. PiSSA leverages the low intrinsic dimensionality of pretrained LLMs, enabling the optimization of a smaller parameter space, thus achieving or surpassing full-parameter fine-tuning performance with significantly less computational overhead. This is primarily achieved by initializing two trainable matrices, AA and BB, with the principal singular values and singular vectors of the matrix WW in the model, supplemented by a frozen residual matrix for error correction.

Theoretical Foundations and Related Works

PiSSA is grounded in the hypothesis, similar to that of Intrinsic Singular Value Decomposition (SVD) and Low-Rank Adaptation (LoRA), that changes in model parameters during fine-tuning exhibit low-rank characteristics. Diverging from LoRA's approach of approximating the changes in WW through random initialization, PiSSA employs a primary decomposition of WW into its principal components for initialization. This orientation allows for a quicker and more effective approximation of full-parameter fine-tuning outcomes by modifying essential parts of WW and freezing "noisy" components, demonstrating a nuanced transformation from conventional PEFT techniques.

Methodology

PiSSA's methodological framework involves the decomposition of a pretrained model's weight matrices using SVD to identify its principal singular values and singular vectors. These are used to initialize the trainable matrices AA and BB, which, along with the residual matrix WresW^{res}, approximate the original matrix WW while significantly reducing the number of trainable parameters.

  • The decomposition enables the separation of essential components (captured by AA and BB) from residual ones (captured in WresW^{res}), focusing fine-tuning efforts on the model's intrinsic, low-dimensional structure.
  • In practice, PiSSA enables quicker convergence and improved performance compared to methods like LoRA by maintaining a focused tuning on matrices that encapsulate the model's primary capabilities.

Experimental Validation

Through extensive experiments involving three LLMs across a variety of tasks, PiSSA has been demonstrated to not only accelerate convergence compared to LoRA but also to effectively approximate full fine-tuning performance with considerably fewer trainable parameters.

  • Key achievements include a significant outperformance of LoRA across multiple benchmarks and models, substantiated by strong numerical results such as achieving a 72.86% accuracy on the GSM8K benchmark with Mistral-7B, outperforming LoRA's 67.7% accuracy.
  • The experiments underscore the viability of PiSSA in encompassing the advantages of LoRA while addressing its limitations through a focused fine-tuning of primary model components.

Practical Implications and Future Outlook

The PiSSA methodology inherits and enhances the operational benefits of LoRA, including parameter efficiency and compatibility with model quantization, while introducing an innovative approach to fine-tuning LLMs. The distinct initialization strategy prioritizing principal model components promises broader applicability in tasks requiring the adaptation of LLMs to specific domains or requirements.

  • The compatibility of PiSSA with existing LLM architectures and its methodological benefits suggest a promising direction for future research in PEFT, including exploring the application of PiSSA across an even broader range of models and tasks.
  • Potential future developments might focus on the integration of PiSSA with advanced model compression techniques or exploring theoretical frameworks to further elucidate the mechanisms behind its efficiency and effectiveness.

In conclusion, PiSSA presents a significant advancement in the fine-tuning of LLMs, offering a practical, efficient, and effective method for leveraging the intrinsic structural properties of pretrained models to achieve superior performance across a range of tasks. Its methodological nuances and experimental successes highlight its potential as a cornerstone in the ongoing development of PEFT techniques for LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  2. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  3. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  4. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  5. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  6. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  7. Wizardlm: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations, 2023.
  8. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  9. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  10. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  12. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148, 2023.
  13. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024.
  14. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
  15. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020.
  16. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  17. Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models. arXiv preprint arXiv:2305.16597, 2023.
  18. Masking as an efficient alternative to finetuning for pretrained language models. arXiv preprint arXiv:2004.12406, 2020.
  19. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205, 2021.
  20. Composable sparse fine-tuning for cross-lingual transfer. arXiv preprint arXiv:2110.07560, 2021.
  21. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021.
  22. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463, 2020.
  23. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12799–12807, 2023.
  24. Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021.
  25. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  26. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  27. Gpt understands, too. AI Open, 2023.
  28. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
  29. Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts. arXiv preprint arXiv:2205.11961, 2022.
  30. Multitask prompt tuning enables parameter-efficient transfer learning. arXiv preprint arXiv:2303.02861, 2023.
  31. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
  32. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020.
  33. Conditional adapters: Parameter-efficient transfer learning with fast inference. Advances in Neural Information Processing Systems, 36, 2024.
  34. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  35. Adapterdrop: On the efficiency of adapters in transformers. arXiv preprint arXiv:2010.11918, 2020.
  36. Tiny-attention adapter: Contexts are more important than the number of parameters. arXiv preprint arXiv:2211.01979, 2022.
  37. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  38. Mera: Merging pretrained adapters for few-shot learning. arXiv preprint arXiv:2308.15982, 2023.
  39. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489, 2021.
  40. Adaptersoup: Weight averaging to improve generalization of pretrained language models. arXiv preprint arXiv:2302.07027, 2023.
  41. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2022.
  42. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558, 2022.
  43. Increlora: Incremental parameter allocation method for parameter-efficient fine-tuning. arXiv preprint arXiv:2308.12043, 2023.
  44. Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411, 2023.
  45. Pruning meets low-rank parameter-efficient fine-tuning. arXiv preprint arXiv:2305.18403, 2023.
  46. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717, 2023.
  47. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  50. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  51. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  52. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  53. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
  54. Evaluating large language models trained on code, 2021.
  55. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  56. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  57. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Fanxu Meng (26 papers)
  2. Zhaohui Wang (30 papers)
  3. Muhan Zhang (89 papers)
Citations (41)