Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conformal Prediction with Large Language Models for Multi-Choice Question Answering (2305.18404v3)

Published 28 May 2023 in cs.CL, cs.LG, and stat.ML

Abstract: As LLMs continue to be widely developed, robust uncertainty quantification techniques will become crucial for their safe deployment in high-stakes scenarios. In this work, we explore how conformal prediction can be used to provide uncertainty quantification in LLMs for the specific task of multiple-choice question-answering. We find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy. This observation can be useful for downstream applications such as selective classification and filtering out low-quality predictions. We also investigate the exchangeability assumption required by conformal prediction to out-of-subject questions, which may be a more realistic scenario for many practical applications. Our work contributes towards more trustworthy and reliable usage of LLMs in safety-critical situations, where robust guarantees of error rate are required.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Uncertainty sets for image classifiers using conformal prediction, 2022a.
  2. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. CoRR, abs/2107.07511, 2021a. URL https://arxiv.org/abs/2107.07511.
  3. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. 2021b. doi: 10.48550/ARXIV.2107.07511. URL https://arxiv.org/abs/2107.07511.
  4. Learn then test: Calibrating predictive algorithms to achieve risk control, 2022b.
  5. On the utility of prediction sets in human-ai teams. arXiv preprint arXiv:2205.01411, 2022.
  6. Bernardo, J. M. The concept of exchangeability and its applications. Far East Journal of Mathematical Sciences, 4:111–122, 1996.
  7. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp.  401–413, 2021.
  8. Learning by transduction, 2013.
  9. Measuring massive multitask language understanding, 2021.
  10. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021. doi: 10.1162/tacl˙a˙00407. URL https://aclanthology.org/2021.tacl-1.57.
  11. Language models (mostly) know what they know, 2022.
  12. Empirical frequentist coverage of deep learning uncertainty quantification procedures. Entropy, 23(12):1608, 2021a.
  13. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digital Medicine, 4(1):4, 2021b.
  14. Towards reliable zero shot classification in self-supervised models with conformal prediction. arXiv preprint arXiv:2210.15805, 2022.
  15. LeCun, Y. Do large language models need sensory grounding for meaning and understanding? In Workshop on Philosophy of Deep Learning, NYU Center for Mind, Brain, and Consciousness and the Columbia Center for Science and Society, Mar 2023. URL https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMRU_Nbi/view.
  16. Distribution-free federated learning with conformal predictions, 2022.
  17. Improving trustworthiness of ai disease severity rating in medical imaging with ordinal conformal prediction sets. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VIII, pp.  545–554. Springer, 2022a.
  18. Three applications of conformal prediction for rating breast density in mammography. arXiv preprint arXiv:2206.12008, 2022b.
  19. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  12008–12016, 2022c.
  20. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223–234, 2019. doi: 10.1080/01621459.2017.1395341. URL https://doi.org/10.1080/01621459.2017.1395341.
  21. Re-examining calibration: The case of question answering. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2814–2829, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.204.
  22. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  23. Algorithmic Learning in a Random World. Springer International Publishing, 2022. doi: 10.1007/978-3-031-06649-8. URL https://doi.org/10.1007%2F978-3-031-06649-8.
  24. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  25. Least-to-most prompting enables complex reasoning in large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Bhawesh Kumar (6 papers)
  2. Charlie Lu (1 paper)
  3. Gauri Gupta (8 papers)
  4. Anil Palepu (12 papers)
  5. David Bellamy (2 papers)
  6. Ramesh Raskar (123 papers)
  7. Andrew Beam (9 papers)
Citations (49)
X Twitter Logo Streamline Icon: https://streamlinehq.com