Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BiLoRA: A Bi-level Optimization Framework for Overfitting-Resilient Low-Rank Adaptation of Large Pre-trained Models (2403.13037v1)

Published 19 Mar 2024 in cs.LG and cs.CL

Abstract: Low-rank adaptation (LoRA) is a popular method for fine-tuning large-scale pre-trained models in downstream tasks by learning low-rank incremental matrices. Though LoRA and its variants effectively reduce the number of trainable parameters compared to full fine-tuning methods, they often overfit training data, resulting in sub-optimal generalization on test data. To address this problem, we introduce BiLoRA, an overfitting-alleviating fine-tuning approach based on bi-level optimization (BLO). BiLoRA employs pseudo singular value decomposition to parameterize low-rank incremental matrices and splits the training of pseudo singular vectors and values across two different subsets of training data. This division, embedded within separate levels of the BLO framework, mitigates the risk of overfitting to a single dataset. Tested on ten datasets covering natural language understanding and generation tasks and applied to various well-known large pre-trained models, BiLoRA significantly outperforms LoRA methods and other fine-tuning approaches, with similar amounts of trainable parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
  4. Betty: An automatic differentiation library for multilevel optimization. arXiv preprint arXiv:2207.02849, 2022.
  5. A new hyperparameters optimization method for convolutional neural networks. Pattern Recognition Letters, 125:828–834, 2019.
  6. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  9. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
  10. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.  1126–1135. PMLR, 2017.
  11. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp.  1165–1173. PMLR, 2017.
  12. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
  13. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  14. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  15. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
  16. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
  17. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
  18. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  19. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020.
  20. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
  21. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  22. Optimizing millions of hyperparameters by implicit differentiation. In International conference on artificial intelligence and statistics, pp.  1540–1552. PMLR, 2020.
  23. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577, 2021.
  24. The e2e dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017.
  25. Reverse-mode ad in a functional framework: Lambda the ultimate backpropagator. ACM Transactions on Programming Languages and Systems (TOPLAS), 30(2):1–36, 2008.
  26. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  27. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897, 2020.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  29. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
  30. A game theoretic framework for model based reinforcement learning. In International conference on machine learning, pp.  7953–7963. PMLR, 2020.
  31. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  32. Adapterdrop: On the efficiency of adapters in transformers. arXiv preprint arXiv:2010.11918, 2020.
  33. A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276–295, 2017.
  34. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  35. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558, 2022.
  36. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  37. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
  38. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
  39. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  40. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  41. idarts: Differentiable architecture search with stochastic implicit gradients. In International Conference on Machine Learning, pp.  12557–12566. PMLR, 2021.
  42. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rushi Qiang (7 papers)
  2. Ruiyi Zhang (98 papers)
  3. Pengtao Xie (86 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets