SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models (2403.02509v1)
Abstract: In recent years, LLMs have become increasingly prevalent, offering remarkable text generation capabilities. However, a pressing challenge is their tendency to make confidently wrong predictions, highlighting the critical need for uncertainty quantification (UQ) in LLMs. While previous works have mainly focused on addressing aleatoric uncertainty, the full spectrum of uncertainties, including epistemic, remains inadequately explored. Motivated by this gap, we introduce a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties. The method entails generating a set of perturbations for LLM inputs, sampling outputs for each perturbation, and incorporating an aggregation module that generalizes the sampling uncertainty approach for text generation tasks. Through extensive experiments on various datasets, we investigated different perturbation and aggregation techniques. Our findings show a substantial improvement in model uncertainty calibration, with a reduction in Expected Calibration Error (ECE) by 50\% on average. Our findings suggest that our proposed UQ method offers promising steps toward enhancing the reliability and trustworthiness of LLMs.
- A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243–297.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Evaluation metrics for language models.
- How robust is gpt-3.5 to predecessors? a comprehensive study on language understanding tasks. arXiv preprint arXiv:2303.00293.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
- Reynolds stress perturbation for epistemic uncertainty quantification of rans models implemented in openfoam. Fluids, 4(2):113.
- Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- Stephen C Hora. 1996. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering & System Safety, 54(2-3):217–223.
- Eyke Hüllermeier and Willem Waegeman. 2021. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110:457–506.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
- Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650.
- On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
- OpenAI. 2022. ChatGPT.
- OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
- Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Exploiting epistemic uncertainty of anatomy segmentation for anomaly detection in retinal oct. IEEE transactions on medical imaging, 39(1):87–98.
- Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Exploiting epistemic uncertainty of the deep learning models to generate adversarial samples. Multimedia Tools and Applications, 81(8):11479–11500.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439.
- Xiang Gao (210 papers)
- Jiaxin Zhang (105 papers)
- Lalla Mouatadid (9 papers)
- Kamalika Das (19 papers)