ConfTuner: Calibrating LLM Verbalized Confidence

Updated 2 September 2025

ConfTuner is a fine-tuning framework that calibrates large language models’ verbalized confidence by aligning expressed certainty with actual correctness probabilities.
It employs a novel tokenized Brier score loss based on proper scoring rules, incentivizing models to generate accurate, linguistically expressed uncertainty.
Empirical results demonstrate up to 54.7% improvement in Expected Calibration Error and 14.4% better AUROC, proving its efficacy in both white-box and black-box settings.

ConfTuner is a fine-tuning framework for calibrating the verbalized confidence of LLMs. The primary goal is to incentivize LLMs to express degrees of confidence in textual form that accurately reflect their actual probability of correctness. ConfTuner directly targets the overconfidence problem—the tendency of models to assign high confidence to incorrect answers—which is particularly detrimental in high-stakes domains such as law, healthcare, and scientific research. Unlike prior attempts relying on prompt engineering or heuristics, ConfTuner is grounded in the mathematical theory of proper scoring rules and introduces a novel loss function, the tokenized Brier score, to align the model’s reported confidence with true uncertainty.

1. Motivation and Problem Setting

LLMs deployed in critical domains require reliably calibrated uncertainty expressed in language, e.g., phrases such as, "I am 80% confident that...". Existing models tend to manifest overconfidence, stating high confidence in their outputs regardless of actual correctness, diminishing reliability and trust. Previous calibration methods primarily involved prompt engineering or learning from heuristically generated uncertainty estimates, techniques that neither generalize well nor robustly incentivize truthful confidence expressions. ConfTuner was developed to provide a theoretically principled and practically efficient approach requiring neither external ground-truth confidence scores nor costly proxy estimation procedures (Li et al., 26 Aug 2025).

2. Tokenized Brier Score: Theoretical Foundations

ConfTuner introduces the "tokenized Brier score" as its central loss function. The standard Brier score, widely used as a proper scoring rule in probabilistic calibration, is adapted to the setting of discrete confidence tokens. Instead of producing direct numeric probabilities, the LLM is trained to generate a distribution $q$ over a pre-defined set of confidence tokens (e.g., integer percentages or scales from 0 to 9).

Given a model output distribution $q$ over $N+1$ tokens and answer correctness indicator $y \in \{0,1\}$ , the tokenized Brier score is defined as:

$\ell(q, y) = \sum_{i=0}^N q_i \cdot (y - i/N)^2$

where $i/N$ is the probability associated with token $i$ .

The paper provides a formal theorem and proof that this loss is a proper scoring rule: the expected risk (loss) is minimized when the model’s verbalized confidence matches its actual probability of being correct (Li et al., 26 Aug 2025). This property guarantees incentive alignment—models are optimally rewarded for honest calibration.

3. ConfTuner Implementation Pipeline

In practice, ConfTuner operates by (i) extracting logits for the set of confidence tokens from the model’s output vocabulary, (ii) applying softmax over these logits to produce a normalized probability vector $q$ , and (iii) applying the tokenized Brier score loss. Training data consists of questions, answers, and correctness indicators, and does not require gold confidence annotations. The supervising signal for confidence is derived directly from answer correctness (i.e., $y$ ). The approach is compatible with standard fine-tuning paradigms such as Low-Rank Adaptation (LoRA) and regularized fine-tuning, allowing existing answer generation quality to be maintained during calibration improvement. This minimal overhead design makes integration into current pipelines straightforward.

4. Empirical Results and Calibration Metrics

Comprehensive experimental evaluation covers multiple reasoning tasks and datasets: HotpotQA, GSM8K, TriviaQA, StrategyQA, and TruthfulQA. Quantitative calibration improvements are measured using:

Expected Calibration Error (ECE): Reductions of up to approximately 54.7% compared to baselines.
Area Under the ROC Curve (AUROC): Improvements up to 14.4%.

Better calibration resulting from ConfTuner directly translates into practical benefits. In downstream applications, calibrated confidence enables self-correction: models can automatically identify low-confidence answers for revision, leading to enhanced robustness. Additionally, ConfTuner facilitates model cascade systems—only answers below a confidence threshold are delegated to more powerful (but costlier) models for further refinement, yielding computational savings in deployment scenarios.

5. Generalization to Black-Box Models

A notable result is the demonstrated ability to calibrate black-box LLMs, such as GPT-4o, using ConfTuner’s methodology. The framework does not require access to model internals (logits, probabilities), and calibration can be applied even when verbalized confidence is given linguistically (e.g., “I’m fairly certain…” instead of explicit numeric probabilities). The underlying loss can be instantiated by mapping such phrases to discrete token scales, maintaining compatibility with the tokenized Brier score objective. This suggests applicability to commercial and closed-source LLMs lacking explicit confidence output channels.

6. Implications for Trustworthy AI and Future Directions

ConfTuner establishes a milestone in reliable LLM calibration by bridging proper scoring rule theory and fine-tuning practice. Well-calibrated verbal confidence supports model transparency, interpretability, and safer deployment in high-stakes domains. The authors identify several future research directions:

Extending beyond fixed discrete tokens to more sophisticated, context-aware uncertainty expressions.
Investigating how data quality, architectural choices, and optimization techniques impact calibration.
Leveraging regularized methods to maintain answer quality while improving calibration.

A plausible implication is that the proper scoring rule approach may generalize even further to joint calibration of multiple uncertainty modalities in conversational agents.

7. Summary Table: ConfTuner Properties and Results

Aspect	Description	Empirical Result
Loss Function	Tokenized Brier score (proper scoring rule)	Theoretically sound; minimizes expected miscalibration
Confidence Output	Discrete tokens (e.g., % scale or 0-9)	Compatible with both explicit and linguistic forms
Calibration Metrics	ECE, AUROC, downstream performance	Up to 54.7% better ECE; up to 14.4% AUROC improvement
Model Type	White-box and black-box LLMs	Generalizes to GPT-4o and similar closed models
Downstream Impact	Self-correction, model cascades, uncertainty alignment	Enables safe, cost-effective AI system deployment

ConfTuner represents a theoretically principled and empirically validated approach for aligning the verbalized confidence of LLMs with true correctness probabilities, thereby advancing trust calibration in critical AI deployments (Li et al., 26 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

ConfTuner: Training Large Language Models to Express Their Confidence Verbally (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ConfTuner.