Few-Shot Recalibration of Language Models (2403.18286v1)

Published 27 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent work has uncovered promising ways to extract well-calibrated confidence estimates from LLMs (LMs), where the model's confidence score reflects how likely it is to be correct. However, while LMs may appear well-calibrated over broad distributions, this often hides significant miscalibration within narrower slices (e.g., systemic over-confidence in math can balance out systemic under-confidence in history, yielding perfect calibration in aggregate). To attain well-calibrated confidence estimates for any slice of a distribution, we propose a new framework for few-shot slice-specific recalibration. Specifically, we train a recalibration model that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice. Our trained model can recalibrate for arbitrary new slices, without using any labeled data from that slice. This enables us to identify domain-specific confidence thresholds above which the LM's predictions can be trusted, and below which it should abstain. Experiments show that our few-shot recalibrator consistently outperforms existing calibration methods, for instance improving calibration error for PaLM2-Large on MMLU by 16%, as compared to temperature scaling.

PDF HTML Abstract

Few-Shot Recalibration for Precision-Centric LLMs

Introduction to Recalibration Needs

LLMs have achieved a significant level of accuracy and reliability across a broad spectrum of domains and tasks. However, while these models may exhibit well-calibrated confidence estimates across a combined distribution of tasks, subtler discrepancies emerge upon closer inspection of individual slices or domains within this distribution. These discrepancies manifest as the model being miscalibrated—showing either overconfidence or underconfidence—on these finer-grained slices. This miscalibration, if unaddressed, can limit the practical reliability of LLMs when deployed in real-world scenarios where domain-specific confidence is crucial for decision-making processes.

Our Contribution: Few-Shot Recalibration

In response to the need for fine-grained calibration, we propose a few-shot recalibration framework. This framework trains a recalibration model that can adjust a base LLM's confidence estimates for any given slice of a distribution, using only a few unlabeled examples from that slice. Notably, our recalibrator does not require any labeled data from the new slice to function effectively. The recalibration process is particularly geared towards identifying domain-specific confidence thresholds, which then delineates the confidence range within which the model's predictions are deemed reliable.

Methodology Explained

Our recalibration approach hinges on the prediction of precision curves as a function of confidence scores. Unlike calibration curves, precision curves do not involve arbitrary binning decisions, rendering them more stable and reliable recalibration targets. We embark on training our recalibrator with synthetic data, where we simulate diverse slices by creating various domain mixtures from a corpus of labeled examples. The training process aims to minimize the discrepancy between the predicted precision curve for a slice and its ground-truth counterpart, derived from the base LLM's performance on labeled examples within the slice.

Analyzing the Results

Upon evaluation, our few-shot recalibrator consistently surpasses traditional calibration and recalibration methods. It demonstrates superior performance in both identifying confidence thresholds that align with target precision levels and minimizing calibration error across different slices. Remarkably, our approach maintains its efficacy even when extended to slices comprising domains that were unseen during the recalibration model's training phase. These results underscore the recalibrator's adaptability and its potential to enhance the precision and reliability of LLMs across a diverse array of domains.

Future Directions and Implications

The introduction of few-shot recalibration presents a meaningful advance in the quest for domain-specific accuracy and reliability of LLMs. By enabling precise control over the confidence threshold above which predictions are considered dependable, our framework paves the way for more nuanced and context-aware applications of these models. Future endeavors could explore the extension of this recalibration framework to other model architectures, including those specializing in generative tasks, and its applicability in multimodal contexts.

Closing Thoughts

As LLMs continue to evolve, ensuring their reliable performance across the spectrum of potential applications remains paramount. The few-shot recalibration framework introduced in this paper represents a significant step towards achieving this goal, offering a viable methodology for tuning these models to exhibit high precision in domain-specific contexts. Its application holds the promise of enhancing the practical utility and trustworthiness of LLMs, making them more adaptable and effective tools in a wide range of scenarios.