Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Few-Shot Recalibration of Language Models (2403.18286v1)

Published 27 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent work has uncovered promising ways to extract well-calibrated confidence estimates from LLMs (LMs), where the model's confidence score reflects how likely it is to be correct. However, while LMs may appear well-calibrated over broad distributions, this often hides significant miscalibration within narrower slices (e.g., systemic over-confidence in math can balance out systemic under-confidence in history, yielding perfect calibration in aggregate). To attain well-calibrated confidence estimates for any slice of a distribution, we propose a new framework for few-shot slice-specific recalibration. Specifically, we train a recalibration model that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice. Our trained model can recalibrate for arbitrary new slices, without using any labeled data from that slice. This enables us to identify domain-specific confidence thresholds above which the LM's predictions can be trusted, and below which it should abstain. Experiments show that our few-shot recalibrator consistently outperforms existing calibration methods, for instance improving calibration error for PaLM2-Large on MMLU by 16%, as compared to temperature scaling.

Few-Shot Recalibration for Precision-Centric LLMs

Introduction to Recalibration Needs

LLMs have achieved a significant level of accuracy and reliability across a broad spectrum of domains and tasks. However, while these models may exhibit well-calibrated confidence estimates across a combined distribution of tasks, subtler discrepancies emerge upon closer inspection of individual slices or domains within this distribution. These discrepancies manifest as the model being miscalibrated—showing either overconfidence or underconfidence—on these finer-grained slices. This miscalibration, if unaddressed, can limit the practical reliability of LLMs when deployed in real-world scenarios where domain-specific confidence is crucial for decision-making processes.

Our Contribution: Few-Shot Recalibration

In response to the need for fine-grained calibration, we propose a few-shot recalibration framework. This framework trains a recalibration model that can adjust a base LLM's confidence estimates for any given slice of a distribution, using only a few unlabeled examples from that slice. Notably, our recalibrator does not require any labeled data from the new slice to function effectively. The recalibration process is particularly geared towards identifying domain-specific confidence thresholds, which then delineates the confidence range within which the model's predictions are deemed reliable.

Methodology Explained

Our recalibration approach hinges on the prediction of precision curves as a function of confidence scores. Unlike calibration curves, precision curves do not involve arbitrary binning decisions, rendering them more stable and reliable recalibration targets. We embark on training our recalibrator with synthetic data, where we simulate diverse slices by creating various domain mixtures from a corpus of labeled examples. The training process aims to minimize the discrepancy between the predicted precision curve for a slice and its ground-truth counterpart, derived from the base LLM's performance on labeled examples within the slice.

Analyzing the Results

Upon evaluation, our few-shot recalibrator consistently surpasses traditional calibration and recalibration methods. It demonstrates superior performance in both identifying confidence thresholds that align with target precision levels and minimizing calibration error across different slices. Remarkably, our approach maintains its efficacy even when extended to slices comprising domains that were unseen during the recalibration model's training phase. These results underscore the recalibrator's adaptability and its potential to enhance the precision and reliability of LLMs across a diverse array of domains.

Future Directions and Implications

The introduction of few-shot recalibration presents a meaningful advance in the quest for domain-specific accuracy and reliability of LLMs. By enabling precise control over the confidence threshold above which predictions are considered dependable, our framework paves the way for more nuanced and context-aware applications of these models. Future endeavors could explore the extension of this recalibration framework to other model architectures, including those specializing in generative tasks, and its applicability in multimodal contexts.

Closing Thoughts

As LLMs continue to evolve, ensuring their reliable performance across the spectrum of potential applications remains paramount. The few-shot recalibration framework introduced in this paper represents a significant step towards achieving this goal, offering a viable methodology for tuning these models to exhibit high precision in domain-specific contexts. Its application holds the promise of enhancing the practical utility and trustworthiness of LLMs, making them more adaptable and effective tools in a wide range of scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Palm 2 technical report.
  2. Peter L Bartlett and Marten H Wegkamp. 2008. Classification with a reject option using a hinge loss. Journal of Machine Learning Research (JMLR), 9(0):1823–1840.
  3. Xnli: Evaluating cross-lingual sentence representations. In Empirical Methods in Natural Language Processing (EMNLP), pages 2475–2485.
  4. Learning with rejection. In International Conference on Algorithmic Learning Theory.
  5. Morris H. DeGroot and Stephen E. Fienberg. 1983. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society. Series D (The Statistician), 32:12–22.
  6. Calibrated selective classification. Transactions on Machine Learning Research.
  7. Yonatan Geifman and Ran El-Yaniv. 2017. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS).
  8. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), pages 1321–1330.
  9. Calibration of neural networks using splines. In International Conference on Learning Representations.
  10. Multicalibration: Calibration for the (Computationally-identifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1939–1948. PMLR.
  11. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR).
  12. Language models (mostly) know what they know.
  13. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
  14. Verified uncertainty calibration. In Advances in Neural Information Processing Systems (NeurIPS).
  15. Neural data augmentation via example extrapolation. arXiv preprint arXiv:2102.01335.
  16. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research.
  17. Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
  18. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943.
  19. Obtaining well calibrated probabilities using bayesian binning. In Association for the Advancement of Artificial Intelligence (AAAI).
  20. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632.
  21. OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  22. John Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61–74.
  23. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
  24. Carla M. Santos-Pereira and Ana M. Pires. 2005. On optimal reject rules and roc curves. Pattern Recognition Letters, 26(7):943–952.
  25. Elias Stengel-Eskin and Benjamin Van Durme. 2023. Calibrated interpretation: Confidence estimation in semantic parsing.
  26. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.
  27. Francesco Tortorella. 2000. An optimal reject rule for binary classifiers. In Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition, page 611–620, Berlin, Heidelberg. Springer-Verlag.
  28. Llama: Open and efficient foundation language models. arXiv.
  29. On the inference calibration of neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3070–3079, Online. Association for Computational Linguistics.
  30. Finetuned language models are zero-shot learners. arXiv.
  31. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
  32. Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7273–7284, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  33. Robust calibration with multi-domain temperature scaling. ArXiv, abs/2206.02757.
  34. Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In International Conference on Machine Learning (ICML), pages 609–616.
  35. Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 694–699.
  36. Navigating the grey area: Expressions of overconfidence and uncertainty in language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xiang Lisa Li (18 papers)
  2. Urvashi Khandelwal (12 papers)
  3. Kelvin Guu (26 papers)
Citations (3)