- The paper introduces Uncertainty Distillation, a method that trains language models to verbalize calibrated semantic confidence alongside their answers.
- It employs Monte Carlo sampling with semantic normalization and post-hoc calibration to convert frequency estimates into accurate confidence scores.
- Empirical results show improved AUROC (e.g., 0.805 on SQuAD vs 0.771 for lexical baselines), highlighting both efficiency and interpretability.
LLMs are increasingly used for tasks requiring factual accuracy, such as question answering. However, these models often struggle to communicate how likely their answers are to be correct, sometimes producing incorrect answers with high apparent confidence. While methods exist to quantify uncertainty, many focus on lexical uncertainty, estimating confidence in the specific text generated. This is insufficient when the same meaning can be expressed in multiple ways, or for complex, multi-part answers. Estimating semantic uncertainty – confidence in the underlying meaning regardless of phrasing – is more desirable but often requires computationally expensive inference-time techniques like sampling multiple responses.
The paper "Uncertainty Distillation: Teaching LLMs to Express Semantic Confidence" (2503.14749) proposes Uncertainty Distillation, a practical method to train LLMs to verbalize calibrated semantic confidence alongside their answers. The core idea is to generate training data by estimating semantic uncertainty for a given model on a calibration dataset and then fine-tuning the model to output this estimated uncertainty using natural language phrases (e.g., "very high confidence"). This allows the model to express semantic confidence in a single inference pass, making it efficient for real-world applications.
The Uncertainty Distillation process involves three main steps:
- Monte Carlo Sampling with Semantic Normalization: For a held-out calibration dataset Scal={Xcal,Ycal}, the base LLM is sampled N times for each input x∈Xcal. The resulting candidate answers {y^​i​}i=1N​ are then processed. A crucial step is semantic normalization to group semantically equivalent answers. For short QA tasks, simple normalization like removing punctuation and standardizing capitalization might suffice. For more complex scenarios, techniques like Natural Language Inference (NLI) or using another LLM as a judge could be employed to cluster semantically similar outputs. After normalization, the relative frequency (f) of each unique semantic answer group is calculated, serving as an initial estimate of the model's confidence in that semantic answer. The authors use N=1000 samples in their experiments, finding diminishing returns beyond this number.
- Post-hoc Calibration: The initial frequency estimates (f) might not be well-calibrated probabilities. To address this, a post-hoc calibration model is trained on the calibration dataset. For each sampled answer y^​ from input x, its frequency f is paired with whether the actual answer y from Ycal falls into the semantic group of y^​. This creates a dataset (f,correctness) which is used to train a calibration function c. The paper uses isotonic regression for this purpose, mapping the frequency f to a calibrated probability p=c(f). The authors note that if the base model is already well-calibrated on the target domain, this step may offer little benefit or can even be detrimental if the calibration data overlaps with the base model's training data.
- Self-annotation and Fine-tuning: The calibrated probabilities p for each answer sampled on the calibration dataset are mapped to discrete confidence bins (e.g., 5 bins like "very low confidence" to "very high confidence"). A new training dataset is created by taking the original inputs x and augmenting the corresponding ground truth answers y (or the model's sampled answers y^​) with the verbalized confidence bin b. For example, a correct answer y with a high calibrated probability p might be transformed into a training example (x,y [very high confidence]). The authors find that including incorrect answers with their corresponding low-confidence verbalizations during this fine-tuning step can improve AUROC but typically decreases the model's overall accuracy, suggesting a trade-off. A hyperparameter controls how many incorrect examples are included. The original LLM is then fine-tuned on this self-annotated dataset. For instruction-tuned models, the prompt is augmented to explicitly ask for the confidence statement.
At inference time, the fine-tuned model directly generates the answer and the verbalized confidence statement in a single output sequence. This avoids the computational cost of sampling multiple outputs, semantic normalization, and running a separate calibration model for each query, making it significantly more efficient than inference-time sampling methods.
The authors evaluate Uncertainty Distillation on T5-base fine-tuned on SQuAD and a custom Instruct-T5 model (T5-Large instruction-tuned on SNI tasks excluding the test set) on a set of held-out SNI QA tasks. They also test on FLAN-T5, which has seen the test tasks during its instruction tuning, to analyze the impact of calibration data assumptions. Metrics include AUROC (Area Under the Receiver Operating Characteristic curve), overall accuracy, and high-confidence accuracy. They compare against a lexical baseline which uses calibrated average token probability.
Key findings from the experiments include:
- Uncertainty Distillation achieves comparable or better AUROC than the lexical baseline, even on short answers where the lexical method is strong (0.805 vs 0.771 on SQuAD, 0.751 vs 0.667 on Instruct-T5 SNI). This demonstrates its effectiveness in teaching semantic confidence without relying on token probabilities at inference.
- The verbalized confidences produced by Uncertainty Distillation are well-aligned with observed accuracy within each confidence bin, showing interpretability.
- Using Uncertainty Distillation can impact overall model accuracy. On SQuAD, T5-base accuracy dropped, while on Instruct-T5 SNI, accuracy increased (likely due to additional fine-tuning on unseen tasks).
- There appears to be a practical trade-off where interventions improving AUROC (like including incorrect examples) tend to decrease overall accuracy.
- When the calibration data is part of the base model's instruction-tuning set (FLAN-T5 case), Uncertainty Distillation still works, but the lexical baseline, benefiting from training on the data, performs similarly or slightly better. Post-hoc calibration helps less or might even hurt in this scenario.
- The number of samples N for Monte Carlo estimation impacts performance, with diminishing returns beyond ~1000.
- Comparison to a semantic clustering baseline shows that Uncertainty Distillation achieves better or comparable AUROC with significantly less compute (1 sample inference vs. 20 samples).
Practical implementation considerations include:
- Computational Cost: Sampling for calibration data generation is the most expensive part. Fine-tuning requires additional compute compared to a base model. However, inference is efficient (single pass).
- Semantic Normalization: Choosing or developing an effective semantic normalization function appropriate for the task is critical, especially for complex outputs.
- Calibration Data: Access to a held-out calibration dataset is ideal for diagnosing and correcting base model miscalibration. If data contamination is likely, ablating the post-hoc calibration step might be necessary.
- Incorrect Examples: Carefully tuning the hyperparameter for including incorrect examples during fine-tuning is needed to balance calibration improvement (AUROC) against potential drops in overall accuracy.
- Verbalization: The specific phrases used for confidence bins can be arbitrary but should be clearly defined and consistent.
Uncertainty Distillation provides a promising approach for deploying LLMs that can communicate semantic confidence efficiently and interpretably, addressing a key limitation for trustworthy AI applications in domains like factual QA. While the current work focuses on relatively simple tasks, the method is designed to be adaptable, with the potential to scale to more complex tasks given appropriate semantic normalization techniques.