Calibrating LLM Confidence by Probing Perturbed Representation Stability (2505.21772v1)

Published 27 May 2025 in cs.CL

Abstract: Miscalibration in LLMs undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model's response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

Calibrating LLM Confidence Through Perturbation Analysis

The paper "Calibrating LLM Confidence by Probing Perturbed Representation Stability" introduces a novel method—CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability)—for improving confidence calibration in LLMs. This technique addresses a critical challenge faced by LLMs: the misalignment between their confidence estimates and actual correctness. By offering a new approach to confidence estimation leveraging internal representation stability, the authors present significant insights and improvements relative to existing methodologies.

Methodology

CCPS is grounded in the hypothesis that an LLM's confidence about its answers is reflected in the stability of its internal representations in response to adversarial perturbations. The approach involves three main stages:

Adversarial Perturbation: For each generated token in an LLM's output, CCPS applies structured perturbations to the final hidden states targeting gradient directions that increase loss, thereby reducing the probability of the token being correct.
Feature Extraction: CCPS extracts features that quantify the impact of these perturbations on the LLM's internal states. These features are designed to capture both the model’s initial response probabilities and their changes due to perturbations.
Confidence Classification: The extracted feature vectors serve as inputs to a lightweight classification model predicting the correctness probability of the LLM’s answer.

Empirical Evidence

The evaluations indicate that CCPS significantly enhances the calibration accuracy across multiple benchmarks, including MMLU and MMLU-Pro, covering both multiple-choice and open-ended question formats. Notably, CCPS reduces Expected Calibration Error (ECE) by approximately 55%, decreases Brier score by 21%, and improves accuracy by 5 percentage points compared to previous methods. These metrics underscore the robustness and efficiency of CCPS in calibrating LLM confidence, offering a model-agnostic enhancement applicable across different LLM architectures, including Meta's Llama, Qwen, and Mistral models.

Implications and Future Directions

The implications of CCPS extend both practically and theoretically. Practically, CCPS improves the reliability of LLMs in high-stakes fields such as finance and medicine where trustworthiness is paramount. Theoretically, the notion that perturbation-induced stability can serve as a proxy for confidence may prompt additional research into similar stability-centric methodologies in machine learning. Future explorations could evaluate applying perturbation analysis for other predictive tasks or integrating it with generative processes to mitigate hallucination and improve the robustness of LLM outputs.

Conclusion

The CCPS methodology represents a promising advancement in the ongoing effort to improve LLM reliability. By focusing on the perturbation-induced stability of internal representations, this method not only achieves substantial quantitative improvements in calibration metrics but also introduces a substantive approach for probing LLM confidence. Its success across diverse model architectures paves the way for further applications and advancements in AI model calibration practices.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers