2000 character limit reached
Bayesian Low-rank Adaptation for Large Language Models (2308.13111v5)
Published 24 Aug 2023 in cs.LG
Abstract: Low-rank adaptation (LoRA) has emerged as a new paradigm for cost-efficient fine-tuning of LLMs. However, fine-tuned LLMs often become overconfident especially when fine-tuned on small datasets. Bayesian methods, with their inherent ability to estimate uncertainty, serve as potent tools to mitigate overconfidence and enhance calibration. In this work, we introduce Laplace-LoRA, which applies a Bayesian approach to the LoRA parameters. Specifically, Laplace-LoRA applies a Laplace approximation to the posterior over the LoRA parameters, considerably improving the calibration of fine-tuned LLMs.
- Deep kernel processes. In International Conference on Machine Learning, pp. 130–140. PMLR, 2021.
- Adapting the linearised laplace model evidence for modern deep learning. In ICML, 2022.
- Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470, 2020.
- Weight uncertainty in neural network. In International conference on machine learning, pp. 1613–1622. PMLR, 2015.
- Checkpoint ensembles: Ensemble methods from a single training process. arXiv preprint arXiv:1710.03282, 2017.
- Calibrating transformers via sparse gaussian processes. arXiv preprint arXiv:2303.02444, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Pathologies in priors and inference for bayesian transformers. arXiv preprint arXiv:2110.04020, 2021.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Laplace redux-effortless bayesian deep learning. NeurIPS, 2021a.
- Bayesian deep learning via subnetwork inference. In ICML, 2021b.
- Accelerated linearized laplace approximation for bayesian deep learning. NeurIPS, 2022.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
- Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
- Efficient and scalable bayesian neural nets with rank-1 factors. In International conference on machine learning, pp. 2782–2792. PMLR, 2020.
- Bayesian attention modules. Advances in Neural Information Processing Systems, 33:16362–16376, 2020.
- ’in-between’uncertainty in bayesian neural networks. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2019.
- Bayesian neural network priors revisited. arXiv preprint arXiv:2102.06571, 2021.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. PMLR, 2016.
- On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017.
- Preserving pre-trained features helps calibrate fine-tuned language models. In ICLR, 2023.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Improving predictions of bayesian neural nets via local linearization. In AISTAT, 2021.
- What are bayesian neural network posteriors really like? In International conference on machine learning, pp. 4629–4640. PMLR, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Being Bayesian, even just a bit, fixes overconfidence in relu networks. In ICML, 2020.
- Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
- Passive learning of active causal strategies in agents and language models, May 2023.
- Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299, 2019.
- Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Uncertainty estimation with infinitesimal jackknife, its distribution and mean-field approximation. ArXiv, abs/2006.07584, 2020.
- David J. C. MacKay. Choice of basis for laplace approximation. Machine Learning, 33(1):77–86, 1998.
- David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 1992.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
- Global inducing point variational posteriors for bayesian neural networks and deep gaussian processes. In International Conference on Machine Learning, pp. 8248–8259. PMLR, 2021.
- OpenAI. GPT-4 technical report, 2023.
- Asdl: A unified interface for gradient preconditioning in pytorch. arXiv preprint arXiv:2305.04684, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- On the calibration of pre-trained language models using mixup guided by area under the margin and saliency. arXiv preprint arXiv:2203.07559, 2022.
- A scalable laplace approximation for neural networks. In ICLR, 2018.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 2021.
- Dept: Decomposed prompt tuning for parameter-efficient fine-tuning. arXiv preprint arXiv:2309.05173, 2023.
- A comprehensive guide to bayesian convolutional neural network with variational inference. arXiv preprint arXiv:1901.02731, 2019.
- Large Language Models Encode Clinical Knowledge. arXiv, 2022.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Bayesian layers: A module for neural network uncertainty. Advances in neural information processing systems, 32, 2019.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, 2019a.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019b.
- Lora ensembles for large language model fine-tuning. arXiv preprint arXiv:2310.00035, 2023.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Transformers: State-of-the-art natural language processing. In EMNLP, 2020.
- BloombergGPT: A Large Language Model for Finance, May 2023.
- Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. In EMNLP, 2022.
- Bayesian transformer language models for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7378–7382. IEEE, 2021.
- Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. arXiv preprint arXiv:2401.00243, 2023.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Cyclical stochastic gradient mcmc for bayesian deep learning. arXiv preprint arXiv:1902.03932, 2019.
- Bayesian attention belief networks. In ICML, 2021.
- On the role of dataset quality and heterogeneity in model confidence. arXiv preprint arXiv:2002.09831, 2020.
- Adam X. Yang (6 papers)
- Maxime Robeyns (6 papers)
- Xi Wang (275 papers)
- Laurence Aitchison (66 papers)