Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multicalibration for Confidence Scoring in LLMs (2404.04689v1)

Published 6 Apr 2024 in stat.ML, cs.CL, and cs.LG
Multicalibration for Confidence Scoring in LLMs

Abstract: This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by LLMs. Multicalibration asks for calibration not just marginally, but simultaneously across various intersecting groupings of the data. We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation" - querying the LLM by asking it various yes-or-no questions about the prompt. We also develop novel variants of multicalibration algorithms that offer performance improvements by reducing their tendency to overfit. Through systematic benchmarking across various question answering datasets and LLMs, we show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.

Multicalibration Techniques Enhance the Trustworthiness of LLMs' Outputs

Overview

Recent advancements in LLMs have significantly benefited various domains by enabling sophisticated text generation and question-answering capabilities. However, these models often suffer from the issue of "hallucination", where the generated outputs deviate from factual or logical accuracy. Addressing this challenge, the research explores leveraging "multicalibration" to improve the reliability of confidence scores associated with LLM outputs. Unlike traditional calibration that seeks consistency across the entire data distribution, multicalibration ensures calibration across intersecting subgroups, thereby offering a more nuanced and trustworthy confidence indication.

Methodology

Generating Groupings for Multicalibration

The paper introduces innovative strategies for forming subgroups suited for multicalibration in the context of LLMs. Since explicit features for subgrouping are typically absent in text data, the paper proposes two main approaches:

  1. Clustering within an Embedding Space: Utilizing embedding representations of prompt/completion pairs, the paper employs clustering techniques to capture semantic and contextual similarities that correlate with the model's hallucination propensity.
  2. Self-Annotation: A novel "self-annotation" approach queries the LLM itself to generate binary labels for prompt/completion pairs based on yes-or-no questions, effectively allowing the model to self-assess and categorize the data.

Novel Multicalibration Algorithms

To address the challenge of calibrating confidence scores across these dynamically formed groups, the researchers develop variations of existing multicalibration algorithms that are less prone to overfitting. Systematic evaluation across multiple datasets and LLM configurations demonstrates that these novel algorithms significantly improve the calibration and accuracy of confidence scores.

Implications and Future Directions

Theoretical Implications: This paper contributes to the understanding of multicalibration in the specific context of LLMs. By extending the concept to dynamically generated groups based on both clustering and self-annotation, it opens new avenues for research in calibrated machine learning methods tailored for generative AI.

Practical Implications: From a practical standpoint, the ability to generate calibrated, group-wise confidence scores for LLM outputs can greatly enhance the trustworthiness and reliability of AI-powered solutions. Such advancements could be pivotal for applications where accuracy and fidelity of generated text are crucial, including but not limited to automated journalism, content creation, and educational tools.

Future Developments: As outlined in the research findings, there is ample scope for future work in refining grouping strategies and further mitigating the risks of overfitting in multicalibration algorithms. Additionally, exploring the application of these techniques across broader types of LLM tasks and outputs, including those beyond text generation, presents an exciting frontier.

Conclusion

In sum, this paper presents a thorough investigation into applying multicalibration techniques for the trustworthy assessment of LLM outputs. By introducing innovative grouping methods and enhancing multicalibration algorithms, the paper marks a significant step towards addressing the challenge of hallucination in AI-generated content. As AI continues to evolve and integrate into various facets of life and industry, ensuring the reliability and trustworthiness of its outputs becomes paramount, making the contributions of this research both timely and impactful.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  2. Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703, 2023.
  3. Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 35:29362–29373, 2022.
  4. Brier, G. W. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
  5. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  6. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175, 2023.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  8. Hallucination detection: Robustly discerning reliable answers in large language models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp.  245–255, 2023.
  9. Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
  10. Conformal autoregressive generation: Beam search with coverage guarantees. arXiv preprint arXiv:2309.03797, 2023.
  11. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379, 2023.
  12. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764, 2023.
  13. Domino: Discovering systematic errors with cross-modal embeddings. arXiv preprint arXiv:2203.14960, 2022.
  14. Chainpoll: A high efficacy method for llm hallucination detection. arXiv preprint arXiv:2310.18344, 2023.
  15. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pp.  1197–1208, 2013.
  16. Conformal prediction with conditional guarantees. arXiv preprint arXiv:2305.12616, 2023.
  17. Multicalibration as boosting for regression. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  11459–11492. PMLR, 2023. URL https://proceedings.mlr.press/v202/globus-harris23a.html.
  18. Omnipredictors. Leibniz international proceedings in informatics, 215, 2022a.
  19. Low-degree multicalibration. In Conference on Learning Theory, pp.  3193–3234. PMLR, 2022b.
  20. Loss minimization through the lens of outcome indistinguishability. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023.
  21. Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500–1517, 2023.
  22. On calibration of modern neural networks. In International conference on machine learning, pp.  1321–1330. PMLR, 2017.
  23. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pp.  1939–1948. PMLR, 2018.
  24. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  25. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023a.
  26. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236, 2023b.
  27. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977.
  28. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  29. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  30. Where does my model underperform? a human evaluation of slice discovery algorithms. arXiv preprint arXiv:2306.08167, 2023.
  31. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  32. Batch multivalid conformal prediction. arXiv preprint arXiv:2209.15145, 2022.
  33. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  34. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
  35. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp.  247–254, 2019.
  36. Bias plus variance decomposition for zero-one loss functions. In ICML, volume 96, pp.  275–283, 1996.
  37. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023.
  38. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):71–96, 2014.
  39. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6449–6464, 2023.
  40. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
  41. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  42. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187, 2023.
  43. A token-level reference-free hallucination detection benchmark for free-form text generation. arXiv preprint arXiv:2104.08704, 2021.
  44. Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308, 2023.
  45. Zero-resource hallucination prevention for large language models. arXiv preprint arXiv:2309.02654, 2023.
  46. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
  47. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  48. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  49. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  50. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  51. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
  52. High-dimensional prediction for sequential decision making. arXiv preprint arXiv:2310.17651, 2023.
  53. Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  54. Conformal language modeling. arXiv preprint arXiv:2306.10193, 2023.
  55. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
  56. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501, 2023.
  57. Robots that ask for help: Uncertainty alignment for large language model planners. arXiv preprint arXiv:2307.01928, 2023.
  58. Roth, A. Uncertain: Modern topics in uncertainty estimation, 2022.
  59. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
  60. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 2024.
  61. Llama 2: Open foundation and fine-tuned chat models, 2023.
  62. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  63. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
  64. Reducing llm hallucinations using epistemic neural networks. arXiv preprint arXiv:2312.15576, 2023.
  65. On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025, 2021.
  66. Uncertainty-aware language modeling for selective question answering. arXiv preprint arXiv:2311.15451, 2023.
  67. Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469, 2023.
  68. Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794, 2023.
  69. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pp.  609–616, 2001.
  70. Forking uncertainties: Reliable prediction and model predictive control with sequence models via conformal risk control. arXiv preprint arXiv:2310.10299, 2023.
  71. SAC3superscriptSAC3\textnormal{SAC}^{3}SAC start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT: Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. arXiv preprint arXiv:2311.01740, 2023a.
  72. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.
  73. Automatic calibration and error correction for large language models via pareto optimal self-supervision. arXiv preprint arXiv:2306.16564, 2023.
  74. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Gianluca Detommaso (10 papers)
  2. Martin Bertran (15 papers)
  3. Riccardo Fogliato (18 papers)
  4. Aaron Roth (138 papers)
Citations (10)