Batched Low-Rank Adaptation of Foundation Models (2312.05677v3)
Abstract: Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.
- Performance, design, and autotuning of batched gemm for gpus. In Information Security Conference, 2016. URL https://api.semanticscholar.org/CorpusID:2559252.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 4211–4215, 2020.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. ArXiv, abs/2106.10199, 2021. URL https://api.semanticscholar.org/CorpusID:231672601.
- Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49:3675–3691, 2023. URL https://api.semanticscholar.org/CorpusID:258205341.
- One-for-all: Generalized lora for parameter-efficient fine-tuning. ArXiv, abs/2306.07967, 2023. URL https://api.semanticscholar.org/CorpusID:259144860.
- Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. ArXiv, abs/2205.14135, 2022. URL https://api.semanticscholar.org/CorpusID:249151871.
- Joe Davison. Compacter: Efficient low-rank hypercomplex adapter layers. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235356070.
- Qlora: Efficient finetuning of quantized llms. ArXiv, abs/2305.14314, 2023. URL https://api.semanticscholar.org/CorpusID:258841328.
- Black-box prompt learning for pre-trained language models. ArXiv, abs/2201.08531, 2022. URL https://api.semanticscholar.org/CorpusID:246210164.
- Mixture-of-domain-adapters: Decoupling and injecting domain knowledge to pre-trained language models’ memories. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:259108831.
- Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. ArXiv, abs/2203.06904, 2022. URL https://api.semanticscholar.org/CorpusID:247446969.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1–120:39, 2021. URL https://api.semanticscholar.org/CorpusID:231573431.
- Making pre-trained language models better few-shot learners. ArXiv, abs/2012.15723, 2021. URL https://api.semanticscholar.org/CorpusID:229923710.
- Parameter-efficient transfer learning with diff pruning. In Annual Meeting of the Association for Computational Linguistics, 2020. URL https://api.semanticscholar.org/CorpusID:229152766.
- Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=0RDcd5Axok.
- Hyperprompt: Prompt-based task-conditioning of transformers. ArXiv, abs/2203.00759, 2022b. URL https://api.semanticscholar.org/CorpusID:247218062.
- Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv, abs/2212.09689, 2022.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, 2019. URL https://api.semanticscholar.org/CorpusID:59599816.
- Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
- Lorahub: Efficient cross-task generalization via dynamic lora composition. ArXiv, abs/2307.13269, 2023. URL https://api.semanticscholar.org/CorpusID:260155012.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2016. URL https://api.semanticscholar.org/CorpusID:6294674.
- Platypus: Quick, cheap, and powerful refinement of llms. ArXiv, abs/2308.07317, 2023. URL https://api.semanticscholar.org/CorpusID:260886870.
- Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2020. URL https://api.semanticscholar.org/CorpusID:220265858.
- The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:233296808.
- Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023. URL https://api.semanticscholar.org/CorpusID:258588247.
- Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), abs/2101.00190, 2021. URL https://api.semanticscholar.org/CorpusID:230433941.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. ArXiv, abs/2205.05638, 2022a. URL https://api.semanticscholar.org/CorpusID:248693283.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. ArXiv, abs/2110.07602, 2021. URL https://api.semanticscholar.org/CorpusID:238857040.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Annual Meeting of the Association for Computational Linguistics, 2022b. URL https://api.semanticscholar.org/CorpusID:248780177.
- Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. ArXiv, abs/2106.04489, 2021. URL https://api.semanticscholar.org/CorpusID:235309789.
- Coherence boosting: When your pretrained language model is not paying enough attention. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:247476407.
- Fine-tuning language models with just forward passes. ArXiv, abs/2305.17333, 2023. URL https://api.semanticscholar.org/CorpusID:258959274.
- Unipelt: A unified framework for parameter-efficient language model tuning. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:238857301.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Measuring the impact of programming language distribution. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256615914.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
- From sparse to soft mixtures of experts. ArXiv, abs/2308.00951, 2023. URL https://api.semanticscholar.org/CorpusID:260378993.
- Robust speech recognition via large-scale weak supervision, 2022. URL https://arxiv.org/abs/2212.04356.
- Progressive neural networks. ArXiv, abs/1606.04671, 2016. URL https://api.semanticscholar.org/CorpusID:15350923.
- Exploiting cloze-questions for few-shot text classification and natural language inference. In Conference of the European Chapter of the Association for Computational Linguistics, 2020a. URL https://api.semanticscholar.org/CorpusID:210838924.
- It’s not just size that matters: Small language models are also few-shot learners. ArXiv, abs/2009.07118, 2020b. URL https://api.semanticscholar.org/CorpusID:221703107.
- Training neural networks with fixed sparse masks. ArXiv, abs/2111.09839, 2021. URL https://api.semanticscholar.org/CorpusID:244345839.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023. URL https://api.semanticscholar.org/CorpusID:259950998.
- Spot: Better frozen model adaptation through soft prompt transfer. ArXiv, abs/2110.07904, 2021. URL https://api.semanticscholar.org/CorpusID:239009558.
- Transprompt: Towards an automatic transferable prompting framework for few-shot text classification. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:243865402.
- Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. ArXiv, abs/2205.12410, 2022a. URL https://api.semanticscholar.org/CorpusID:249063002.
- Parameter-efficient tuning of large language models. 2022b. URL https://api.semanticscholar.org/CorpusID:249536106.
- Self-instruct: Aligning language model with self generated instructions. ArXiv, abs/2212.10560, 2022c.
- Orca: A distributed serving system for transformer-based generative models. In USENIX Symposium on Operating Systems Design and Implementation, 2022. URL https://api.semanticscholar.org/CorpusID:251734964.
- Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. ArXiv, abs/2308.03303, 2023a. URL https://api.semanticscholar.org/CorpusID:260683267.
- Adaptive budget allocation for parameter-efficient fine-tuning. ArXiv, abs/2303.10512, 2023b. URL https://api.semanticscholar.org/CorpusID:257631760.
- Yeming Wen (14 papers)
- Swarat Chaudhuri (61 papers)