Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access (2403.01216v2)

Published 2 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: This study aims to address the pervasive challenge of quantifying uncertainty in LLMs without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Uncertainty sets for image classifiers using conformal prediction. arXiv preprint arXiv:2009.14193.
  2. Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511.
  3. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052.
  4. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  6. Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers.
  7. Terrance DeVries and Graham W. Taylor. 2018. Learning confidence for out-of-distribution detection in neural networks.
  8. Conformal prediction for text infilling and part-of-speech prediction.
  9. Applying the conformal prediction paradigm for the uncertainty quantification of an end-to-end automatic speech recognition model (wav2vec 2.0). In Proceedings of the Twelfth Symposium on Conformal and Probabilistic Prediction with Applications, volume 204 of Proceedings of Machine Learning Research, pages 16–35. PMLR.
  10. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA. PMLR.
  11. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  12. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  13. Conformal prediction for deep classifier via label ranking. arXiv preprint arXiv:2310.06430.
  14. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
  15. A review of nonconformity measures for conformal prediction in regression. Conformal and Probabilistic Prediction with Applications, pages 369–383.
  16. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404.
  17. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  18. Y LeCun. 2023. Do large language models need sensory grounding for meaning and understanding? In Workshop on Philosophy of Deep Learning, NYU Center for Mind, Brain, and Consciousness and the Columbia Center for Science and Society.
  19. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
  20. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
  21. Bert-based conformal predictor for sentiment analysis. In Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications, volume 128 of Proceedings of Machine Learning Research, pages 269–284. PMLR.
  22. James Manyika and Sissie Hsiao. 2023. An overview of bard: an early experiment with generative ai. AI. Google Static Documents, 2.
  23. Prevent the language model from being overconfident in neural machine translation.
  24. Khanh Nguyen and Brendan O’Connor. 2015. Posterior calibration and exploratory analysis for natural language processing models. arXiv preprint arXiv:1508.05154.
  25. OpenAI. 2023. GPT-4v(ision): technical work.
  26. René Peinl and Johannes Wirth. 2023. Evaluation of medium-large language models at zero-shot closed book generative question answering. arXiv preprint arXiv:2305.11991.
  27. Conformal language modeling. arXiv preprint arXiv:2306.10193.
  28. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223–234.
  29. Glenn Shafer and Vladimir Vovk. 2008. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3).
  30. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  31. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  32. Exploring predictive uncertainty and calibration in NLP: A study on the impact of method & data scarcity. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2707–2735, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  33. Benchmarking scalable predictive uncertainty in text classification. IEEE Access, 10:43703–43737.
  34. Generation probabilities are not enough: Exploring the effectiveness of uncertainty highlighting in ai-powered code completions.
  35. Algorithmic learning in a random world, volume 29. Springer.
  36. Equal opportunity of coverage in fair regression. Advances in Neural Information Processing Systems, 36.
  37. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  39. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846.
  40. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  41. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  42. Detection of word adversarial examples in text classification: Benchmark and baseline via robust density estimation.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiayuan Su (10 papers)
  2. Jing Luo (77 papers)
  3. Hongwei Wang (150 papers)
  4. Lu Cheng (73 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.