$\textit{Trans-LoRA}$: towards data-free Transferable Parameter Efficient Finetuning (2405.17258v1)
Abstract: Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning (PEFT) techniques that closely match full model fine-tune performance while requiring only a small number of additional parameters. These additional LoRA parameters are specific to the base model being adapted. When the base model needs to be deprecated and replaced with a new one, all the associated LoRA modules need to be re-trained. Such re-training requires access to the data used to train the LoRA for the original base model. This is especially problematic for commercial cloud applications where the LoRA modules and the base models are hosted by service providers who may not be allowed to host proprietary client task data. To address this challenge, we propose $\textit{Trans-LoRA}$ -- a novel method for lossless, nearly data-free transfer of LoRAs across base models. Our approach relies on synthetic data to transfer LoRA modules. Using LLMs, we design a synthetic data generator to approximate the data-generating process of the $\textit{observed}$ task data subset. Training on the resulting synthetic dataset transfers LoRA modules to new models. We show the effectiveness of our approach using both LLama and Gemma model families. Our approach achieves lossless (mostly improved) LoRA transfer between models within and across different base model families, and even between different PEFT methods, on a wide variety of tasks.
- How protective are synthetic data? In International Conference on Privacy in Statistical Databases, pages 239–246. Springer, 2008.
- AI@Meta. Llama 3 model card. 2024.
- Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024.
- Program synthesis with large language models, 2021.
- Distilling from professors: Enhancing the knowledge distillation of teachers. Information sciences, 576:743–755, 2021.
- A review of feature selection methods on synthetic data. Knowledge and information systems, 34:483–519, 2013.
- Synthetic data for annotation and extraction of family history information from clinical text. Journal of Biomedical Semantics, 12:1–11, 2021.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
- Training verifiers to solve math word problems, 2021.
- Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
- Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
- Gemini: A family of highly capable multimodal models, 2024a.
- Gemma: Open models based on gemini research and technology, 2024b.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Gpt-4 technical report, 2024c.
- On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12799–12807, 2023.
- A framework for few-shot language model evaluation, 2023.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
- A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.
- Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In 14th Workshop on Innovative Use of NLP for Building Educational Applications, pages 252–263. Association for Computational Linguistics, 2019.
- A review on generative adversarial networks: Algorithms, theory, and applications. IEEE transactions on knowledge and data engineering, 35(4):3313–3332, 2021.
- Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024.
- On the effectiveness of adapter-based tuning for pretrained language model adaptation. arXiv preprint arXiv:2106.03164, 2021.
- Generate, annotate, and learn: Nlp with synthetic text. Transactions of the Association for Computational Linguistics, 10:826–842, 2022.
- Measuring massive multitask language understanding, 2021.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models, 2021a.
- Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv preprint arXiv:2108.02035, 2021b.
- Robust visual domain adaptation with low-rank reconstruction. In 2012 IEEE conference on computer vision and pattern recognition, pages 2168–2175. IEEE, 2012.
- Knowledge distillation: Bad models can be good role models. Advances in Neural Information Processing Systems, 35:28683–28694, 2022.
- Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
- Nola: Compressing lora using linear combination of random basis, 2024.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv preprint arXiv:2402.13064, 2024.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36, 2024a.
- Coupled generative adversarial networks. Advances in neural information processing systems, 29, 2016.
- Dora: Weight-decomposed low-rank adaptation, 2024b.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, pages 5191–5198, 2020.
- Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
- Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
- Sergey I Nikolenko. Synthetic data for deep learning. Springer, 2021.
- OpenAI. https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates. 2024.
- Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019.
- The synthetic data vault. In 2016 IEEE international conference on data science and advanced analytics (DSAA), pages 399–410. IEEE, 2016.
- Towards understanding knowledge distillation. In International conference on machine learning, pages 5142–5151. PMLR, 2019.
- Efficient domain generalization via common-specific low-rank decomposition. In International Conference on Machine Learning, pages 7728–7738. PMLR, 2020.
- Training question answering models from synthetic data. arXiv preprint arXiv:2002.09599, 2020.
- S-lora: Serving thousands of concurrent lora adapters, 2023.
- Mpnet: Masked and permuted pre-training for language understanding, 2020.
- Lab: Large-scale alignment for chatbots. arXiv preprint arXiv:2403.01081, 2024.
- Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
- Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 969–977, 2018.
- Efficient knowledge distillation from model checkpoints. Advances in Neural Information Processing Systems, 35:607–619, 2022a.
- Synthetic data made to order: The case of parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
- Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
- Multitask prompt tuning enables parameter-efficient transfer learning, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
- The expressive power of low-rank adaptation. arXiv preprint arXiv:2310.17513, 2023.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3713–3722, 2019.
- Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4388–4403, 2021a.
- Unsupervised domain adaptation with adapter. arXiv preprint arXiv:2111.00667, 2021b.
- Self-distillation as instance-specific label smoothing. Advances in Neural Information Processing Systems, 33:2184–2195, 2020.
- Low-rank plus diagonal adaptation for deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5005–5009. IEEE, 2016.
- Runqian Wang (3 papers)
- Soumya Ghosh (39 papers)
- David Cox (48 papers)
- Diego Antognini (27 papers)
- Aude Oliva (42 papers)
- Rogerio Feris (105 papers)
- Leonid Karlinsky (79 papers)