Multicalibration for Confidence Scoring in LLMs
Abstract: This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by LLMs. Multicalibration asks for calibration not just marginally, but simultaneously across various intersecting groupings of the data. We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation" - querying the LLM by asking it various yes-or-no questions about the prompt. We also develop novel variants of multicalibration algorithms that offer performance improvements by reducing their tendency to overfit. Through systematic benchmarking across various question answering datasets and LLMs, we show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
- Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703, 2023.
- Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 35:29362–29373, 2022.
- Brier, G. W. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
- Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Hallucination detection: Robustly discerning reliable answers in large language models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 245–255, 2023.
- Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
- Conformal autoregressive generation: Beam search with coverage guarantees. arXiv preprint arXiv:2309.03797, 2023.
- Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379, 2023.
- Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764, 2023.
- Domino: Discovering systematic errors with cross-modal embeddings. arXiv preprint arXiv:2203.14960, 2022.
- Chainpoll: A high efficacy method for llm hallucination detection. arXiv preprint arXiv:2310.18344, 2023.
- Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pp. 1197–1208, 2013.
- Conformal prediction with conditional guarantees. arXiv preprint arXiv:2305.12616, 2023.
- Multicalibration as boosting for regression. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 11459–11492. PMLR, 2023. URL https://proceedings.mlr.press/v202/globus-harris23a.html.
- Omnipredictors. Leibniz international proceedings in informatics, 215, 2022a.
- Low-degree multicalibration. In Conference on Learning Theory, pp. 3193–3234. PMLR, 2022b.
- Loss minimization through the lens of outcome indistinguishability. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023.
- Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500–1517, 2023.
- On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017.
- Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pp. 1939–1948. PMLR, 2018.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023a.
- Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236, 2023b.
- Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Where does my model underperform? a human evaluation of slice discovery algorithms. arXiv preprint arXiv:2306.08167, 2023.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- Batch multivalid conformal prediction. arXiv preprint arXiv:2209.15145, 2022.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
- Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247–254, 2019.
- Bias plus variance decomposition for zero-one loss functions. In ICML, volume 96, pp. 275–283, 1996.
- Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023.
- Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):71–96, 2014.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6449–6464, 2023.
- Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187, 2023.
- A token-level reference-free hallucination detection benchmark for free-form text generation. arXiv preprint arXiv:2104.08704, 2021.
- Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308, 2023.
- Zero-resource hallucination prevention for large language models. arXiv preprint arXiv:2309.02654, 2023.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
- Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
- Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
- High-dimensional prediction for sequential decision making. arXiv preprint arXiv:2310.17651, 2023.
- Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
- Conformal language modeling. arXiv preprint arXiv:2306.10193, 2023.
- A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
- Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501, 2023.
- Robots that ask for help: Uncertainty alignment for large language model planners. arXiv preprint arXiv:2307.01928, 2023.
- Roth, A. Uncertain: Modern topics in uncertainty estimation, 2022.
- Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
- A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 2024.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
- Reducing llm hallucinations using epistemic neural networks. arXiv preprint arXiv:2312.15576, 2023.
- On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025, 2021.
- Uncertainty-aware language modeling for selective question answering. arXiv preprint arXiv:2311.15451, 2023.
- Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469, 2023.
- Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794, 2023.
- Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pp. 609–616, 2001.
- Forking uncertainties: Reliable prediction and model predictive control with sequence models via conformal risk control. arXiv preprint arXiv:2310.10299, 2023.
- SAC3superscriptSAC3\textnormal{SAC}^{3}SAC start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. arXiv preprint arXiv:2311.01740, 2023a.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.
- Automatic calibration and error correction for large language models via pareto optimal self-supervision. arXiv preprint arXiv:2306.16564, 2023.
- Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.