Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conformal Language Modeling (2306.10193v2)

Published 16 Jun 2023 in cs.CL and cs.LG

Abstract: We propose a novel approach to conformal prediction for generative LLMs (LMs). Standard conformal prediction produces prediction sets -- in place of single predictions -- that have rigorous, statistical performance guarantees. LM responses are typically sampled from the model's predicted distribution over the large, combinatorial output space of natural language. Translating this process to conformal prediction, we calibrate a stopping rule for sampling different outputs from the LM that get added to a growing set of candidates until we are confident that the output set is sufficient. Since some samples may be low-quality, we also simultaneously calibrate and apply a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we prove that the sampled set returned by our procedure contains at least one acceptable answer with high probability, while still being empirically precise (i.e., small) on average. Furthermore, within this set of candidate responses, we show that we can also accurately identify subsets of individual components -- such as phrases or sentences -- that are each independently correct (e.g., that are not "hallucinations"), again with statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.

Conformal LLMing

The paper entitled "Conformal LLMing" introduces a novel approach to applying conformal prediction principles to generative LLMs (LMs). This research offers a method to create prediction sets from LLM outputs that maintain rigorous statistical performance guarantees.

Conformal prediction is a statistical technique used to provide reliable prediction sets without strict distributional assumptions, traditionally applied in contexts like classification. The technique is adapted here to accommodate the inherently infinite and combinatorial output space of LMs, such as those used in natural language generation tasks. The proposed methodology centers around a principled stopping rule for sampling, coupled with a rejection rule designed to eliminate low-quality samples. This adaptation is necessary because typical conformal predictors cannot feasibly enumerate all candidate outputs in such expansive output domains.

The paper asserts that, by the end of the sampling process, the constructed set contains at least one acceptable answer with a high probability, thereby ensuring coverage. Importantly, the approach goes beyond this general assurance to identify specific, independently correct subsets of generated text, which is particularly significant given the susceptibility of LMs to producing hallucinated or incorrect content.

Key contributions of this work include:

  1. Extension of Conformal Prediction: The paper extends traditional conformal prediction to work with generative models, notably modern LMs, overcoming the challenge of unbounded output spaces.
  2. Practical Application: The researchers provide empirical validation across tasks in open-domain question answering, text summarization, and more domain-specific tasks such as radiology report generation. This showcases the applicability of the approach across varied contexts.
  3. Theoretical Guarantees: The authors provide rigorous theoretical underpinnings that ensure the coverage properties of the conformal sets generated by their method, aligning with conventional conformal prediction methodologies while adapting them for generative settings.
  4. Component Confidence: The work also addresses the challenge of phrase or sentence-level evaluation within LM outputs, which enables the identification of non-hallucinated, credible text segments.

The implications of this research are substantial. The ability to quantify uncertainty and provide confidence-backed output sets from LMs can significantly bolster their reliability and trustworthiness, especially in high-stakes or sensitive applications like medical diagnostics or law.

For future work, one potential direction is the integration of advanced evaluation metrics within the conformal framework to better handle varied and nuanced correctness criteria. Moreover, exploring alternative scoring and selection methods may further refine precision and efficiency. Finally, the consideration of cross-linguistic or multi-modal requirements presents a fertile avenue for research, particularly in aligning these models with more diverse dataset requirements and more complex input conditions.

Overall, this research contributes a pivotal step forward in bridging the gap between theoretical statistical guarantees and the practical deployment of large-scale LLMs, enhancing their robustness in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. A gentle introduction to conformal prediction and distribution-free uncertainty quantification, 2022.
  2. Conformal risk control, 2023.
  3. Learn then test: Calibrating predictive algorithms to achieve risk control. ArXiv preprint: 2110.01052, 2021a.
  4. Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations (ICLR), 2021b.
  5. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507, 2021.
  6. Distribution free, risk controlling prediction sets. ArXiv preprint: 2101.02703, 2020.
  7. Attributed question answering: Evaluation and modeling for attributed large language models, 2023.
  8. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  9. Predictive inference with weak supervision, 2022.
  10. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.21. URL https://aclanthology.org/2020.emnlp-main.21.
  11. Conformal prediction for text infilling and part-of-speech prediction. The New England Journal of Statistics in Data Science, 1(1):69–83, 2022. ISSN 2693-7166. doi: 10.51387/22-NEJSDS8.
  12. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  13. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.187. URL https://aclanthology.org/2022.naacl-main.187.
  14. Efficient conformal prediction via cascaded inference with expanded admission. In International Conference on Learning Representations (ICLR), 2021a.
  15. Few-shot conformal prediction with auxiliary tasks. In International Conference on Machine Learning (ICML), 2021b.
  16. Conformal prediction sets with limited false positives. In International Conference on Machine Learning (ICML), 2022.
  17. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301.
  18. Distribution-free binary classification: prediction sets, confidence intervals and calibration. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  19. Teaching machines to read and comprehend. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf.
  20. Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65–70, 1979.
  21. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020.
  22. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 161–175, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.dialdoc-1.19. URL https://aclanthology.org/2022.dialdoc-1.19.
  23. Conffusion: Confidence intervals for diffusion models. 2022.
  24. Pairreranker: Pairwise reranking for natural language generation, 2022.
  25. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021. doi: 10.1162/tacl_a_00407. URL https://aclanthology.org/2021.tacl-1.57.
  26. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019.
  27. Capturing failures of large language models via human cognitive biases. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=fcO9Cgn-X-R.
  28. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
  29. Language models (mostly) know what they know. 2022.
  30. Scitail: A textual entailment dataset from science question answering. In AAAI Conference on Artificial Intelligence, 2018.
  31. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.393. URL https://aclanthology.org/2021.naacl-main.393.
  32. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VD-AYtP0dve.
  33. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177, 2022. doi: 10.1162/tacl_a_00453. URL https://aclanthology.org/2022.tacl-1.10.
  34. Efficiently controlling multiple risks with pareto testing. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=cyg2YXn_BqF.
  35. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278–287, 2013.
  36. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  37. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  38. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  39. Teaching models to express their uncertainty in words. ArXiv, abs/2205.14334, 2022b.
  40. Clinically accurate chest x-ray report generation. In Machine Learning for Healthcare Conference, pages 249–269. PMLR, 2019.
  41. Evaluating verifiability in generative search engines, 2023.
  42. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint, 2022.
  43. Prevent the language model from being overconfident in neural machine translation. 2021.
  44. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. URL https://aclanthology.org/2022.tacl-1.50.
  45. Enhancing self-consistency and performance of pre-trained language models through natural language inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1754–1768, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.115.
  46. Improving factual completeness and consistency of image-to-text radiology report generation, 2021.
  47. Improving chest x-ray report generation by leveraging warm-starting, 2022.
  48. OpenAI. Gpt-4 technical report. 2023.
  49. Harris Papadopoulos. Inductive conformal prediction: Theory and application to neural networks. In Tools in Artificial Intelligence, chapter 18. IntechOpen, Rijeka, 2008.
  50. Inductive confidence machines for regression. In European Conference on Machine Learning, pages 345–356. Springer, 2002.
  51. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  52. Exploring the limits of transfer learning with a unified text-to-text transformer. 2020.
  53. Characteristics of harmful text: Towards rigorous benchmarking of language models, 2022.
  54. Conformal nucleus sampling. 2023.
  55. Conformalized quantile regression. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  56. Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 624–643, Online, June 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.52. URL https://aclanthology.org/2021.naacl-main.52.
  57. Consistent accelerated inference via confident adaptive transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4962–4979, Online and Punta Cana, Dominican Republic, November 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.406. URL https://aclanthology.org/2021.emnlp-main.406.
  58. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 394–412, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.28.
  59. Confident adaptive language modeling. In Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=uLYc4L3C81A.
  60. Get to the point: Summarization with pointer-generator networks. CoRR, abs/1704.04368, 2017. URL http://arxiv.org/abs/1704.04368.
  61. Adafactor: Adaptive learning rates with sublinear memory cost, 2018.
  62. Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert, 2020.
  63. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022.
  64. How to trust your diffusion model: A convex optimization approach to conformal risk control. ArXiv, abs/2302.03791, 2023.
  65. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
  66. Llama: Open and efficient foundation language models, 2023.
  67. Generation probabilities are not enough: Exploring the effectiveness of uncertainty highlighting in ai-powered code completions. 2023.
  68. Vladimir Vovk. On-line confidence machines are well-calibrated. In The 43rd Annual IEEE Symposium on Foundations of Computer Science., 2002.
  69. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005.
  70. Large-scale probabilistic predictors with and without guarantees of validity. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
  71. Nonparametric predictive distributions based on conformal prediction. In Proceedings of the Sixth Workshop on Conformal and Probabilistic Prediction and Applications, 2017.
  72. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
  73. Challenges in detoxifying language models. ArXiv, abs/2109.07445, 2021.
  74. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
  75. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv preprint: 1910.03771, 2019.
  76. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  77. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311, 2023.
  78. On uncertainty calibration and selective generation in probabilistic neural summarization: A benchmark study. 2023.
  79. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1131. URL https://aclanthology.org/N19-1131.
  80. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. ArXiv, abs/2302.13439, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Victor Quach (4 papers)
  2. Adam Fisch (32 papers)
  3. Tal Schuster (33 papers)
  4. Adam Yala (13 papers)
  5. Jae Ho Sohn (6 papers)
  6. Tommi S. Jaakkola (42 papers)
  7. Regina Barzilay (106 papers)
Citations (45)