Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 89 tok/s
GPT OSS 120B 457 tok/s Pro
Kimi K2 169 tok/s Pro
2000 character limit reached

ConfTuner: Calibrating LLM Verbalized Confidence

Updated 2 September 2025
  • ConfTuner is a fine-tuning framework that calibrates large language models’ verbalized confidence by aligning expressed certainty with actual correctness probabilities.
  • It employs a novel tokenized Brier score loss based on proper scoring rules, incentivizing models to generate accurate, linguistically expressed uncertainty.
  • Empirical results demonstrate up to 54.7% improvement in Expected Calibration Error and 14.4% better AUROC, proving its efficacy in both white-box and black-box settings.

ConfTuner is a fine-tuning framework for calibrating the verbalized confidence of LLMs. The primary goal is to incentivize LLMs to express degrees of confidence in textual form that accurately reflect their actual probability of correctness. ConfTuner directly targets the overconfidence problem—the tendency of models to assign high confidence to incorrect answers—which is particularly detrimental in high-stakes domains such as law, healthcare, and scientific research. Unlike prior attempts relying on prompt engineering or heuristics, ConfTuner is grounded in the mathematical theory of proper scoring rules and introduces a novel loss function, the tokenized Brier score, to align the model’s reported confidence with true uncertainty.

1. Motivation and Problem Setting

LLMs deployed in critical domains require reliably calibrated uncertainty expressed in language, e.g., phrases such as, "I am 80% confident that...". Existing models tend to manifest overconfidence, stating high confidence in their outputs regardless of actual correctness, diminishing reliability and trust. Previous calibration methods primarily involved prompt engineering or learning from heuristically generated uncertainty estimates, techniques that neither generalize well nor robustly incentivize truthful confidence expressions. ConfTuner was developed to provide a theoretically principled and practically efficient approach requiring neither external ground-truth confidence scores nor costly proxy estimation procedures (Li et al., 26 Aug 2025).

2. Tokenized Brier Score: Theoretical Foundations

ConfTuner introduces the "tokenized Brier score" as its central loss function. The standard Brier score, widely used as a proper scoring rule in probabilistic calibration, is adapted to the setting of discrete confidence tokens. Instead of producing direct numeric probabilities, the LLM is trained to generate a distribution qq over a pre-defined set of confidence tokens (e.g., integer percentages or scales from 0 to 9).

Given a model output distribution qq over N+1N+1 tokens and answer correctness indicator y{0,1}y \in \{0,1\}, the tokenized Brier score is defined as:

(q,y)=i=0Nqi(yi/N)2\ell(q, y) = \sum_{i=0}^N q_i \cdot (y - i/N)^2

where i/Ni/N is the probability associated with token ii.

The paper provides a formal theorem and proof that this loss is a proper scoring rule: the expected risk (loss) is minimized when the model’s verbalized confidence matches its actual probability of being correct (Li et al., 26 Aug 2025). This property guarantees incentive alignment—models are optimally rewarded for honest calibration.

3. ConfTuner Implementation Pipeline

In practice, ConfTuner operates by (i) extracting logits for the set of confidence tokens from the model’s output vocabulary, (ii) applying softmax over these logits to produce a normalized probability vector qq, and (iii) applying the tokenized Brier score loss. Training data consists of questions, answers, and correctness indicators, and does not require gold confidence annotations. The supervising signal for confidence is derived directly from answer correctness (i.e., yy). The approach is compatible with standard fine-tuning paradigms such as Low-Rank Adaptation (LoRA) and regularized fine-tuning, allowing existing answer generation quality to be maintained during calibration improvement. This minimal overhead design makes integration into current pipelines straightforward.

4. Empirical Results and Calibration Metrics

Comprehensive experimental evaluation covers multiple reasoning tasks and datasets: HotpotQA, GSM8K, TriviaQA, StrategyQA, and TruthfulQA. Quantitative calibration improvements are measured using:

  • Expected Calibration Error (ECE): Reductions of up to approximately 54.7% compared to baselines.
  • Area Under the ROC Curve (AUROC): Improvements up to 14.4%.

Better calibration resulting from ConfTuner directly translates into practical benefits. In downstream applications, calibrated confidence enables self-correction: models can automatically identify low-confidence answers for revision, leading to enhanced robustness. Additionally, ConfTuner facilitates model cascade systems—only answers below a confidence threshold are delegated to more powerful (but costlier) models for further refinement, yielding computational savings in deployment scenarios.

5. Generalization to Black-Box Models

A notable result is the demonstrated ability to calibrate black-box LLMs, such as GPT-4o, using ConfTuner’s methodology. The framework does not require access to model internals (logits, probabilities), and calibration can be applied even when verbalized confidence is given linguistically (e.g., “I’m fairly certain…” instead of explicit numeric probabilities). The underlying loss can be instantiated by mapping such phrases to discrete token scales, maintaining compatibility with the tokenized Brier score objective. This suggests applicability to commercial and closed-source LLMs lacking explicit confidence output channels.

6. Implications for Trustworthy AI and Future Directions

ConfTuner establishes a milestone in reliable LLM calibration by bridging proper scoring rule theory and fine-tuning practice. Well-calibrated verbal confidence supports model transparency, interpretability, and safer deployment in high-stakes domains. The authors identify several future research directions:

  • Extending beyond fixed discrete tokens to more sophisticated, context-aware uncertainty expressions.
  • Investigating how data quality, architectural choices, and optimization techniques impact calibration.
  • Leveraging regularized methods to maintain answer quality while improving calibration.

A plausible implication is that the proper scoring rule approach may generalize even further to joint calibration of multiple uncertainty modalities in conversational agents.

7. Summary Table: ConfTuner Properties and Results

Aspect Description Empirical Result
Loss Function Tokenized Brier score (proper scoring rule) Theoretically sound; minimizes expected miscalibration
Confidence Output Discrete tokens (e.g., % scale or 0-9) Compatible with both explicit and linguistic forms
Calibration Metrics ECE, AUROC, downstream performance Up to 54.7% better ECE; up to 14.4% AUROC improvement
Model Type White-box and black-box LLMs Generalizes to GPT-4o and similar closed models
Downstream Impact Self-correction, model cascades, uncertainty alignment Enables safe, cost-effective AI system deployment

ConfTuner represents a theoretically principled and empirically validated approach for aligning the verbalized confidence of LLMs with true correctness probabilities, thereby advancing trust calibration in critical AI deployments (Li et al., 26 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)