VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections (2405.17991v2)
Abstract: LLMs have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.
- Memory efficient adaptive optimization. In NeurIPS, 2019.
- Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023.
- Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations. In NeurIPS, 2019.
- Lora learns less and forgets less, 2024.
- Continual learning in low-rank orthogonal subspaces. In NeurIPS, 2020.
- Non-convex projected gradient descent for generalized low-rank tensor regression. J. Mach. Learn. Res., 2019.
- Adaptformer: Adapting vision transformers for scalable visual recognition. In NeurIPS, 2022.
- Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016.
- Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. CoRR, abs/1509.03025, 2015.
- 8-bit optimizers via block-wise quantization. In ICLR, 2022.
- Qlora: Efficient finetuning of quantized llms. In NeurIPS, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- A note on lora. CoRR, abs/2404.05086, 2024.
- The reversible residual network: Backpropagation without storing activations. In NeurIPS, 2017.
- Gradient descent happens in a tiny subspace. CoRR, abs/1812.04754, 2018.
- Adaptive gradient sparsification for efficient federated learning: An online learning approach. In Conference on distributed computing systems (ICDCS), 2020.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, 2020.
- S. Jie and Z.-H. Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. AAAI, 2023.
- Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning. CoRR, abs/2309.06922, 2023.
- How many degrees of freedom do we need to train deep networks: a loss landscape perspective. In ICLR, 2022.
- Y. Lee and S. Choi. Gradient-based meta-learning with learned layerwise metric and subspace. In ICML, 2018.
- Memory efficient optimizers with 4-bit states. In ICLR, 2023.
- S. Li and T. Hoefler. Near-optimal sparse allreduce for distributed deep learning. In Symposium on Principles and Practice of Parallel Programming, 2022.
- Relora: High-rank training through low-rank updates. CoRR, abs/2307.05695, 2023.
- Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS, 2022.
- Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
- Full parameter fine-tuning for large language models with limited resources. CoRR, abs/2306.09782, 2023.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- C. Park and N. Lee. S33{}^{\mbox{3}}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTgd-mv: Sparse-signsgd with majority vote for communication-efficient distributed learning. In IEEE International Symposium on Information Theory, ISIT, 2023.
- Automatic differentiation in PyTorch. 2017.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
- Tied-lora: Enhacing parameter efficiency of lora with weight tying. abs/2311.09578, 2023.
- Stanford alpaca: An instruction-following llama model. Technical report, 2023.
- Fetchsgd: Communication-efficient federated learning with sketching. In ICLR, 2020.
- Rethinking gradient sparsification as total error minimization. In NeurIPS, 2021.
- Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 2019.
- Efficient top-k query processing on massively parallel hardware. In International Conference on Management of Data, 2018.
- N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, 2018.
- S-lora: Serving thousands of concurrent lora adapters. CoRR, abs/2311.03285, 2023.
- Understanding top-k sparsification in distributed deep learning. CoRR, abs/1911.08772, 2019.
- Sparsified sgd with memory. In NeurIPS, 2018.
- VL-ADAPTER: parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022.
- R. Sutton. The bitter lesson. https://blog.biocomm.ai/2019/03/13/the-bitter-lesson-rich-sutton-march-13-2019/, 2019.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019.
- Multilora: Democratizing lora for better multi-task learning. CoRR, abs/2311.11501, 2023.
- Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, 2018.
- Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NeurIPS, 2017.
- Chain of lora: Efficient fine-tuning of language models via residual learning. CoRR, abs/2401.04151, 2024.
- The visual task adaptation benchmark. CoRR, abs/1910.04867, 2019.
- Galore: Memory-efficient llm training by gradient low-rank projection. In ICML, 2024.
- Roy Miles (9 papers)
- Pradyumna Reddy (11 papers)
- Ismail Elezi (28 papers)
- Jiankang Deng (96 papers)