Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections (2405.17991v2)

Published 28 May 2024 in cs.CV and cs.AI

Abstract: LLMs have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Memory efficient adaptive optimization. In NeurIPS, 2019.
  2. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023.
  3. Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. In NeurIPS, 2019.
  4. Lora learns less and forgets less, 2024.
  5. Continual learning in low-rank orthogonal subspaces. In NeurIPS, 2020.
  6. Non-convex projected gradient descent for generalized low-rank tensor regression. J. Mach. Learn. Res., 2019.
  7. Adaptformer: Adapting vision transformers for scalable visual recognition. In NeurIPS, 2022.
  8. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016.
  9. Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. CoRR, abs/1509.03025, 2015.
  10. 8-bit optimizers via block-wise quantization. In ICLR, 2022.
  11. Qlora: Efficient finetuning of quantized llms. In NeurIPS, 2023.
  12. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  13. A note on lora. CoRR, abs/2404.05086, 2024.
  14. The reversible residual network: Backpropagation without storing activations. In NeurIPS, 2017.
  15. Gradient descent happens in a tiny subspace. CoRR, abs/1812.04754, 2018.
  16. Adaptive gradient sparsification for efficient federated learning: An online learning approach. In Conference on distributed computing systems (ICDCS), 2020.
  17. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  18. Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, 2020.
  19. S. Jie and Z.-H. Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. AAAI, 2023.
  20. Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning. CoRR, abs/2309.06922, 2023.
  21. How many degrees of freedom do we need to train deep networks: a loss landscape perspective. In ICLR, 2022.
  22. Y. Lee and S. Choi. Gradient-based meta-learning with learned layerwise metric and subspace. In ICML, 2018.
  23. Memory efficient optimizers with 4-bit states. In ICLR, 2023.
  24. S. Li and T. Hoefler. Near-optimal sparse allreduce for distributed deep learning. In Symposium on Principles and Practice of Parallel Programming, 2022.
  25. Relora: High-rank training through low-rank updates. CoRR, abs/2307.05695, 2023.
  26. Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS, 2022.
  27. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.
  28. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
  29. Full parameter fine-tuning for large language models with limited resources. CoRR, abs/2306.09782, 2023.
  30. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  31. C. Park and N. Lee. S33{}^{\mbox{3}}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTgd-mv: Sparse-signsgd with majority vote for communication-efficient distributed learning. In IEEE International Symposium on Information Theory, ISIT, 2023.
  32. Automatic differentiation in PyTorch. 2017.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
  34. Tied-lora: Enhacing parameter efficiency of lora with weight tying. abs/2311.09578, 2023.
  35. Stanford alpaca: An instruction-following llama model. Technical report, 2023.
  36. Fetchsgd: Communication-efficient federated learning with sketching. In ICLR, 2020.
  37. Rethinking gradient sparsification as total error minimization. In NeurIPS, 2021.
  38. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 2019.
  39. Efficient top-k query processing on massively parallel hardware. In International Conference on Management of Data, 2018.
  40. N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, 2018.
  41. S-lora: Serving thousands of concurrent lora adapters. CoRR, abs/2311.03285, 2023.
  42. Understanding top-k sparsification in distributed deep learning. CoRR, abs/1911.08772, 2019.
  43. Sparsified sgd with memory. In NeurIPS, 2018.
  44. VL-ADAPTER: parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022.
  45. R. Sutton. The bitter lesson. https://blog.biocomm.ai/2019/03/13/the-bitter-lesson-rich-sutton-march-13-2019/, 2019.
  46. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  47. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019.
  48. Multilora: Democratizing lora for better multi-task learning. CoRR, abs/2311.11501, 2023.
  49. Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, 2018.
  50. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NeurIPS, 2017.
  51. Chain of lora: Efficient fine-tuning of language models via residual learning. CoRR, abs/2401.04151, 2024.
  52. The visual task adaptation benchmark. CoRR, abs/1910.04867, 2019.
  53. Galore: Memory-efficient llm training by gradient low-rank projection. In ICML, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Roy Miles (9 papers)
  2. Pradyumna Reddy (11 papers)
  3. Ismail Elezi (28 papers)
  4. Jiankang Deng (96 papers)
Citations (2)

Summary

Vector Projected LoRA (VeLoRA): A Novel Approach for Efficient Training of LLMs

Introduction

The exponential growth in the size of LLMs has posed significant challenges in terms of computational expense and memory consumption during training. Recent advancements in natural language processing have showcased the potential of LLMs, but their practical implementation often encounters bottlenecks due to the substantial resources required for storing intermediate activations and computing gradients. Several techniques, such as GaLore, gradient checkpointing, and activation offloading, have been developed to mitigate these memory constraints; however, they still introduce a notable computational overhead, limit memory savings, or necessitate specialized hardware.

Motivation and Objective

Given the primary role of compute power in advancing machine learning, and the expectation that LLM sizes will continue to grow, developing methods that are both efficient and scalable is imperative. This paper introduces Vector Projected LoRA (VeLoRA), a novel approach designed to address the memory consumption issue without compromising model performance. VeLoRA exploits the observation that intermediate activations can be effectively compressed and reconstructed using a fixed one-dimensional projection vector, significantly reducing the memory required for backward propagation.

Proposed Method: VeLoRA

VeLoRA achieves memory efficiency by projecting intermediate activations onto a lower-dimensional subspace during the forward pass and reconstructing them during the backward pass. The process involves the following key steps:

  1. Grouping: Dividing input tokens into smaller sub-tokens.
  2. Projection: Using a single, fixed projection vector initialized with first-order batch statistics to compress these sub-tokens into a one-dimensional subspace.
  3. Reconstruction: Reconstructing the original tokens during the backward pass using the same projection vector.

This compression method is computationally light as it avoids costly operations such as Singular Value Decomposition (SVD) and gradient checkpointing. Furthermore, the fixed nature of the projection vector means there is no need to update it throughout the training, thus reducing computational overhead.

Experimental Results

The efficacy of VeLoRA was validated across different benchmarks, including VTAB-1k, GLUE, and MMLU, as well as tasks involving both moderate-size vision transformers and LLMs such as LLaMA. VeLoRA demonstrated substantial memory reductions without sacrificing performance:

  1. Vision Experiments (VTAB-1k): Combined with various PEFT methods like SSF, Hydra, and LoRA, VeLoRA lowered memory requirements while either maintaining or improving accuracy.
  2. Roberta Experiments: On the GLUE benchmark, VeLoRA reduced memory consumption by up to 45\% compared to full fine-tuning, with only a minor decrease in performance.
  3. Scaling to LLaMA Models: When applied with QLoRA to LLMs, VeLoRA offered significant memory savings (15-14.4\%) while enhancing performance on tasks like Alpaca and evaluation benchmarks such as MMLU.
  4. Pre-training on C4: VeLoRA outperformed competing methods in pre-training scenarios, showing lower validation perplexity and reducing on-device GPU memory usage.

Implications and Future Directions

The results indicate that VeLoRA can significantly alleviate memory constraints in training LLMs, thereby enabling the use of larger models on existing hardware. This method's implications are vast, potentially democratizing access to advanced AI research by lowering the hardware barrier. Furthermore, VeLoRA's compatibility with existing PEFT methods enables more efficient fine-tuning, aligning well with current trends toward resource-efficient AI development.

Conclusion

VeLoRA represents a significant step forward in addressing the memory bottlenecks associated with training large-scale LLMs. By compressing intermediate activations into a fixed low-dimensional space, VeLoRA offers a practical solution that enhances memory efficiency while maintaining model performance. This method's ability to integrate seamlessly with other PEFT techniques and its broad applicability across various model sizes and tasks underscore its potential to become a standard approach in the efficient training of neural networks.

Limitations and Broader Impact

The applicability of VeLoRA beyond Transformer-based models remains to be explored, and although the method significantly alleviates memory constraints, it does not address the total training time. As VeLoRA facilitates access to high-quality research for institutions with limited resources, it simultaneously raises concerns about the misuse of advanced AI technologies. Ensuring responsible use and continued assessment of the socio-ethical impact remains crucial as the field progresses.