Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoRA ensembles for large language model fine-tuning (2310.00035v2)

Published 29 Sep 2023 in cs.LG and cs.AI

Abstract: Finetuned LLMs often exhibit poor uncertainty quantification, manifesting as overconfidence, poor calibration, and unreliable prediction results on test data or out-of-distribution samples. One approach commonly used in vision for alleviating this issue is a deep ensemble, which constructs an ensemble by training the same model multiple times using different random initializations. However, there is a huge challenge to ensembling LLMs: the most effective LLMs are very, very large. Keeping a single LLM in memory is already challenging enough: keeping an ensemble of e.g. 5 LLMs in memory is impossible in many settings. To address these issues, we propose an ensemble approach using Low-Rank Adapters (LoRA), a parameter-efficient fine-tuning technique. Critically, these low-rank adapters represent a very small number of parameters, orders of magnitude less than the underlying pre-trained model. Thus, it is possible to construct large ensembles of LoRA adapters with almost the same computational overhead as using the original model. We find that LoRA ensembles, applied on its own or on top of pre-existing regularization techniques, gives consistent improvements in predictive accuracy and uncertainty quantification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  5. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  6. Few-shot learning via learning the representation, provably. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=pW2Q2xLwIMD.
  7. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  8. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.  1050–1059. PMLR, 2016.
  9. Uncertainty estimation for language reward models. arXiv preprint arXiv:2203.07472, 2022.
  10. On calibration of modern neural networks. In International conference on machine learning, pp.  1321–1330. PMLR, 2017.
  11. Preserving pre-trained features helps calibrate fine-tuned language models. In The Eleventh International Conference on Learning Representations, 2022.
  12. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  13. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  14. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  15. Ensembles and cocktails: Robust finetuning for natural language generation. 2021.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJYwwY9ll.
  18. What are bayesian neural network posteriors really like? In International conference on machine learning, pp.  4629–4640. PMLR, 2021.
  19. Capturing failures of large language models via human cognitive biases. Advances in Neural Information Processing Systems, 35:11785–11799, 2022.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  22. Rl with kl penalties is better viewed as bayesian inference. arXiv preprint arXiv:2205.11275, 2022.
  23. Clam: Selective clarification for ambiguous questions with large language models. arXiv preprint arXiv:2212.07769, 2022.
  24. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  25. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
  26. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  27. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022.
  28. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  29. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  30. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  31. Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  32. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
  35. On the calibration of pre-trained language models using mixup guided by area under the margin and saliency. arXiv preprint arXiv:2203.07559, 2022.
  36. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  37. Language models are unsupervised multitask learners. 2019.
  38. Bayesformer: Transformer with uncertainty estimation. arXiv preprint arXiv:2206.00826, 2022.
  39. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  40. Large language models encode clinical knowledge. Nature, pp.  1–9, 2023.
  41. Quantifying uncertainty in foundation models via ensembles. In NeurIPS 2022 Workshop on Robustness in Sequence Modeling, 2022.
  42. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
  43. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  44. Plex: Towards reliability using pretrained large model extensions. arXiv preprint arXiv:2207.07411, 2022.
  45. Uncertainty estimation of transformer predictions for misclassification detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8237–8252, 2022.
  46. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  47. Batchensemble: an alternative approach to efficient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715, 2020.
  48. How good is the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405, 2020a.
  49. Hyperparameter ensembles for robustness and uncertainty quantification. Advances in Neural Information Processing Systems, 33:6514–6527, 2020b.
  50. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  51. Understanding and improving information transfer in multi-task learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylzhkBtDB.
  52. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023.
  53. Neural ensemble search for performant and calibrated predictions. arXiv preprint arXiv:2006.08573, 2(3), 2020.
  54. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  55. Cyclical stochastic gradient mcmc for bayesian deep learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkeS1RVtPS.
  56. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xi Wang (275 papers)
  2. Laurence Aitchison (66 papers)
  3. Maja Rudolph (25 papers)
Citations (25)