Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning (2401.04151v1)

Published 8 Jan 2024 in cs.LG and cs.CL
Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning

Abstract: Fine-tuning is the primary methodology for tailoring pre-trained LLMs to specific tasks. As the model's scale and the diversity of tasks expand, parameter-efficient fine-tuning methods are of paramount importance. One of the most widely used family of methods is low-rank adaptation (LoRA) and its variants. LoRA encodes weight update as the product of two low-rank matrices. Despite its advantages, LoRA falls short of full-parameter fine-tuning in terms of generalization error for certain tasks. We introduce Chain of LoRA (COLA), an iterative optimization framework inspired by the Frank-Wolfe algorithm, to bridge the gap between LoRA and full parameter fine-tuning, without incurring additional computational costs or memory overheads. COLA employs a residual learning procedure where it merges learned LoRA modules into the pre-trained LLM parameters and re-initilize optimization for new born LoRA modules. We provide theoretical convergence guarantees as well as empirical results to validate the effectiveness of our algorithm. Across various models (OPT and llama-2) and seven benchmarking tasks, we demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.

Overview of COLA

Chain of LoRA (COLA) introduces an iterative optimization framework to efficiently fine-tune pre-trained LLMs while striking a balance between computational efficiency and model performance. Advancements in fine-tuning methods are crucial considering the expanding scale of models and the diversity of tasks they are expected to perform. The key to COLA's approach is to apply a series of low-rank updates to the weight matrices of the LLM instead of adjusting the full set of parameters.

The Shortcomings of LoRA and COLA's Solution

Typically, parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) focus on minimal modifications to a model's weights. Despite its efficiency, LoRA sometimes lags behind full-parameter tuning in terms of generalization ability. COLA aims to bridge this performance gap by implementing residual learning, which adds sequential low-rank modifications that incrementally improve task-specific performance with theoretical and empirical support.

The Methodology

COLA starts with a pre-trained LLM, upon which it applies these low-rank changes through three primary stages: tuning the LoRA modules, tying a knot (merging changes into the main model), and then initializing new adjustments. This cycle is repeated, effectively building a chain of updates that refine the model's weights without significantly increasing computational costs. The process embodies the essence of the Frank-Wolfe algorithm, an established optimization technique known for its projection-free approach to tackling constrained optimization problems.

Empirical and Theoretical Advancement

Researchers validated COLA's efficiency across different benchmark tasks and demonstrated that it surpasses LoRA's performance without incurring extra computational or memory overhead. The strength of COLA resides not just in practice but also in theory, as the mathematical framework guarantees convergence in nonconvex optimization problems. The experimental results using OPT and llama-2 models highlight COLA's potential, yielding a relative test accuracy gain of up to 6.47% compared to the LoRA baseline on certain tasks.

Future Exploration

Moving forward, the research team is investigating COLA's interaction with different base optimizers and applying the framework to more demanding tasks such as generation and summarization. Their ongoing efforts are poised to further unravel COLA's advantages and constraints, potentially establishing it as a cornerstone technique for the efficient fine-tuning of the ever-growing LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Linear convergence of a frank-wolfe type algorithm over trace-norm balls. Advances in neural information processing systems, 30, 2017.
  2. Simple, scalable adaptation for neural machine translation. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  1538–1548, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1165. URL https://aclanthology.org/D19-1165.
  3. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023.
  4. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  5. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
  6. A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3):1493–1528, 2016.
  7. Hazan, E. Sparse approximate solutions to semidefinite programs. In LATIN, pp.  306–316, 2008.
  8. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.
  9. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  10. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  12. Jaggi, M. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML, 2013.
  13. Lacoste-Julien, S. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345, 2016.
  14. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  15. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  16. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, August 2021.
  17. Decoupled weight decay regularization. In 7th International Conference on Learning Representations (ICLR), 2019.
  18. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489, 2021.
  19. Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
  20. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  21. Improving in-context learning via bidirectional alignment. arXiv preprint arXiv:2312.17055, 2023.
  22. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th annual Allerton conference on communication, control, and computing (Allerton), pp.  1244–1251. IEEE, 2016.
  23. Tied-lora: Enhacing parameter efficiency of lora with weight tying. arXiv preprint arXiv:2311.09578, 2023.
  24. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
  25. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  5986–5995. PMLR, 09–15 Jun 2019.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  27. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  28. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
  29. Multilora: Democratizing lora for better multi-task learning. arXiv preprint arXiv:2311.11501, 2023.
  30. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  31. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Wenhan Xia (13 papers)
  2. Chengwei Qin (28 papers)
  3. Elad Hazan (106 papers)
Citations (41)