Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning (2405.18380v2)

Published 28 May 2024 in cs.LG, cs.AI, and cs.CL
OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

Abstract: The rapid advancements in LLMs have revolutionized various natural language processing tasks. However, the substantial size of LLMs presents significant challenges in training or fine-tuning. While parameter-efficient approaches such as low-rank adaptation (LoRA) have gained popularity, they often compromise performance compared to full-rank fine-tuning. In this paper, we propose Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore), a new memory-efficient fine-tuning approach, inspired by the layerwise outlier distribution of LLMs. Unlike LoRA, which adds extra adapters to all layers, OwLore strategically assigns higher sampling probabilities to layers with more outliers, selectively sampling only a few layers and fine-tuning their pre-trained weights. To further increase the number of fine-tuned layers without a proportional rise in memory costs, we incorporate gradient low-rank projection, further boosting the approach's performance. Our extensive experiments across various architectures, including LLaMa2, LLaMa3, and Mistral, demonstrate that OwLore consistently outperforms baseline approaches, including full fine-tuning. Specifically, it achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark, a 3.0% improvement on MMLU, and a notable 10% boost on MT-Bench, while being more memory efficient. OwLore allows us to fine-tune LLaMa2-7B with only 21GB of memory. Code is available at https://github.com/pixeli99/OwLore.

A Comprehensive Analysis of Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore) for LLM Fine-tuning

The substantial capabilities of LLMs have propelled advancements in various NLP tasks. However, the significant size of these models poses considerable challenges in terms of training and fine-tuning, especially regarding memory efficiency. This paper introduces Outlier-weighed Layerwise Sampled Low-Rank Projection (OwLore), a novel and memory-efficient fine-tuning approach that combines insights from Heavy-Tailed Self-Regularization (HT-SR) theory with layerwise outlier distribution for optimal layer sampling and low-rank training.

Key Contributions

This work makes several important contributions to the field of LLM fine-tuning:

  1. Outlier Distribution and Heavy-Tailed Self-Regularization Theory: The authors interpret the layerwise outlier distribution of LLMs through the lens of HT-SR theory, revealing that layers with more outliers exhibit a more heavy-tailed empirical spectral density (ESD) and are thereby better trained. This observation forms the basis for their layerwise sampling strategy.
  2. Outlier-weighed Sampling: Inspired by the non-uniform distribution of outliers, OwLore's sampling strategy assigns higher probabilities to layers with more outliers. This principle efficiently utilizes the well-trained layers in pre-trained LLMs, improving the performance of sampling-based fine-tuning methods.
  3. Gradient Low-Rank Projection: To address the memory demands of full-rank training, OwLore integrates gradient low-rank projection. This allows each layer to be efficiently trained within a low-rank subspace, thus mitigating memory costs without compromising performance.

Methodology

OwLore innovates by combining two primary strategies: outlier-weighed sampling and low-rank gradient updates.

  • Outlier-weighed Sampling:

The authors compute the Layerwise Outlier Distribution (LOD) and allocate sampling probabilities proportional to the density of outliers in each layer. This approach creates a "rich-get-richer" phenomenon, where well-trained layers are sampled and fine-tuned more frequently.

  • Low-Rank Gradient Updates:

By adopting the GaLore method, OwLore projects gradients into a low-rank subspace, significantly reducing memory overhead. The optimizer states are updated within this subspace, with the gradient subspace being refreshed periodically to capture dynamic changes during training.

Experimental Results

The empirical evaluation of OwLore demonstrates its robustness and efficiency across multiple architectures and benchmarks, including LLaMa2, LLaMa3, and Mistral. Noteworthy results include:

  • Commonsense Reasoning:

OwLore achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark and consistently outperforms other fine-tuning approaches, including full fine-tuning.

  • MT-Bench:

OwLore records a 10% improvement in the MT-Bench evaluation, particularly excelling in multi-turn question-answering and instruction-following tasks.

  • MMLU:

OwLore achieves a 3.0% improvement on the MMLU benchmark, highlighting its robustness across diverse knowledge domains.

Additionally, OwLore allows fine-tuning LLaMa2-7B with only 21GB of memory, significantly lower than other methods.

Implications and Future Work

The introduction of OwLore advances the field of LLM fine-tuning by offering a method that balances performance and memory efficiency. Theoretically, it builds on HT-SR theory to provide a principled approach to layerwise sampling. Practically, its memory-efficient design makes it suitable for deploying large-scale LLMs in resource-constrained environments.

Future developments could explore further optimization of the low-rank subspace updating mechanisms and their impacts on training dynamics. Additionally, extending OwLore's principles to other domains such as computer vision and multi-modal models could prove beneficial, given the increasing prevalence of large, multi-task models in these fields.

In summary, OwLore represents a significant step forward in parameter-efficient fine-tuning of LLMs, setting a new benchmark in both memory usage and model performance. The insights derived from its development offer a fertile ground for future research aiming to optimize the fine-tuning process of large-scale neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Lora learns less and forgets less. arXiv preprint arXiv:2405.09673, 2024.
  3. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  4. Freezeout: Accelerate training by progressively freezing layers. arXiv preprint arXiv:1706.04983, 2017.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  8. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems (NeurIPs), 2022.
  9. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  10. Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531, 2022.
  11. Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021.
  12. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  13. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  14. B. M. Hill. A simple general approach to inference about the tail of a distribution. The annals of statistics, pages 1163–1174, 1975.
  15. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
  18. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  19. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745, 2023.
  20. T. Kloek and H. K. Van Dijk. Bayesian estimates of equation system parameters: an application of integration by monte carlo. Econometrica: Journal of the Econometric Society, pages 1–19, 1978.
  21. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861, 2023.
  22. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454, 2023.
  23. Bert busters: Outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990, 2021.
  24. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  25. Smartfrz: An efficient training framework using attention-based layer freezing. arXiv preprint arXiv:2401.16720, 2024.
  26. X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  27. Relora: High-rank training through low-rank updates. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023), 2023.
  28. Stack more layers differently: High-rank training through low-rank updates. arXiv preprint arXiv:2307.05695, 2023.
  29. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024.
  30. Gpt understands, too. arxiv. arXiv preprint arXiv:2103.10385, 2021.
  31. Autofreeze: Automatically freezing model blocks to accelerate fine-tuning. arXiv preprint arXiv:2102.01386, 2021.
  32. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489, 2021.
  33. Traditional and heavy-tailed self regularization in neural network models. arXiv preprint arXiv:1901.08276, 2019.
  34. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 505–513. SIAM, 2020.
  35. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 22(165):1–73, 2021.
  36. Meta. Llama3. https://github.com/meta-llama/llama3, 2024.
  37. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  38. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. arXiv preprint arXiv:2403.17919, 2024.
  39. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  40. Outliers dimensions that disrupt transformers are driven by frequency. arXiv preprint arXiv:2205.11380, 2022.
  41. Tied-lora: Enhacing parameter efficiency of lora with weight tying. arXiv preprint arXiv:2311.09578, 2023.
  42. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  43. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  44. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
  45. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  46. Use chat gpt to solve programming bugs. International Journal of Information technology and Computer Engineering, (31):17–22, 2023.
  47. Is chatgpt the ultimate programming assistant–how far is it? arXiv preprint arXiv:2304.11938, 2023.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151, 2024.
  50. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  51. Heavy-tailed regularization of weight matrices in deep neural networks. In International Conference on Artificial Neural Networks, pages 236–247. Springer, 2023.
  52. Test accuracy vs. generalization gap: Model selection in nlp without accessing training or testing data. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3011–3021, 2023.
  53. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. In International Conference on Machine Learning. PMLR., 2024.
  54. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  55. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023.
  56. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507, 2024.
  57. P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pages 1–9. PMLR, 2015.
  58. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  59. Factual probing is [mask]: Learning vs. learning to recall. arXiv preprint arXiv:2104.05240, 2021.
  60. Temperature balancing, layer-wise weight analysis, and neural network training. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Pengxiang Li (24 papers)
  2. Lu Yin (85 papers)
  3. Xiaowei Gao (10 papers)
  4. Shiwei Liu (75 papers)
Citations (3)