GrACE: Generative Confidence in LLMs

Updated 12 November 2025

GrACE is a mechanism that provides calibrated confidence estimates for LLM outputs using a novel generative reflection process.
It appends a dedicated confidence token to measure uncertainty in a single forward pass without extra computational overhead.
Benchmark results show improved calibration and discrimination metrics, making GrACE effective for high-stakes domains like healthcare and finance.

GrACE (Generative cAlibrated Confidence Elicitation) is a mechanism for real-time, calibrated confidence estimation in LLMs, designed to be both computationally lightweight and robustly reliable for high-stakes applications such as healthcare, finance, and legal domains. GrACE uniquely elicits confidence by leveraging a generative reflection step in the model's output sequence, directly utilizing model internal representations to produce a well-calibrated, discriminative, and instantly available confidence score, all without resorting to sampling, auxiliary models, or significant computation overhead.

1. Motivation and Problem Definition

LLMs, while capable of fluent and contextually relevant text generation, are prone to producing factually incorrect or "hallucinated" outputs. In safety-critical environments, it is essential for an LLM not only to provide answers, but also to output a trustworthy assessment of its own uncertainty. Confidence elicitation formalizes this as the computation of a scalar $c$ for each prediction $y$ , with two desiderata:

Discrimination: The score should reliably separate correct from incorrect outputs: $P(c_m > c_n\,|\, y_m \text{ correct}, y_n \text{ incorrect}) \approx 1$ .
Calibration: Empirical correctness should align with confidence: $P(y \text{ correct} | c = p) \approx p$ for all $p$ .

Existing approaches often require expensive sampling, cumbersome auxiliary models, or produce poorly calibrated confidences, limiting their practical deployment in domains where real-time, actionable uncertainty is mandatory (Zhang et al., 11 Sep 2025).

2. Generative Confidence Mechanism

GrACE's core innovation is the introduction of an explicit “confidence” token, $\langle\mathrm{CNF}\rangle$ , appended immediately after each generated model answer in both training and inference. The model, after generating its standard response $y$ , is prompted to generate $\langle\mathrm{CNF}\rangle$ , effectively “reflecting” on its just-produced answer within its latent space.

Let:

$z_L^{\langle\text{CNF}\rangle} \in \mathbb{R}^d$ denote the last hidden state at the $\langle\mathrm{CNF}\rangle$ position,
$e^{\langle\text{CNF}\rangle} \in \mathbb{R}^d$ the learned embedding for $\langle\mathrm{CNF}\rangle$ ,
$E\in \mathbb{R}^{|V|+1 \times d}$ the full embedding matrix.

The confidence score is defined as: $c = \mathrm{sim}(z_L^{\langle\text{CNF}\rangle},\,e^{\langle\text{CNF}\rangle}) = \left[\mathrm{softmax}(E\,z_L^{\langle\text{CNF}\rangle})\right]_{\langle\text{CNF}\rangle}$ This directly reflects the model’s built-in probability of producing $\langle\mathrm{CNF}\rangle$ , thus obtaining a scalar $c\in[0, 1]$ in a single forward pass. No additional tokens, prompts, or passes are required beyond the initial completion.

3. Calibration Fine-Tuning Objective

To endow the $\langle\mathrm{CNF}\rangle$ -based confidence with empirical calibration, GrACE leverages a targeted fine-tuning procedure:

Dataset Construction: A calibration set $\mathcal{D}_c = \{(x_i, y_i, t_i)\}$ , where $x_i$ is the prompt, $y_i$ the model's answer, and $t_i$ the empirical accuracy for the bin into which the initial probe score $s_i$ fell.
Parameter Tuning: Only LoRA adapters (parameter-efficient adapters) and $e^{\langle\text{CNF}\rangle}$ are updated, constituting approximately 0.3–0.4% of total parameters.
Loss Function:

$\mathcal{L}_T = \frac{1}{|\mathcal{D}_c|} \sum_{i} (t_i - c_i)^2 - \gamma\log p_\theta(y_i|x_i)$

where $c_i$ is the model’s confidence for item $i$ and $\gamma$ controls the calibration (MSE) vs. language modeling (SFT) trade-off.

This enables efficient convergence (3 epochs, batch size 8, $\text{lr}=1\mathrm{e}{-5}$ on a single RTX3090) and avoids overfitting, since only low-capacity adapters and a single token vector are trained.

4. Benchmarks, Metrics, and Comparative Evaluation

The efficacy of GrACE is established on open-ended QA datasets (TriviaQA, SciQ) and across multiple LLM backbones (Phi-3-3.8B, Llama2-7B, Llama3.1-8B-Instruct). Correctness is measured by ROUGE-based metrics.

Evaluation proceeds along standard axes:

Metric	Definition	Target
ECE	Expected Calibration Error (10 bins)	Lower is better
Brier	$E[(c - \mathbb{1}\{\text{correct}\})^2]$	Lower is better
AUROC	Area under ROC for discrimination	Higher is better
Accuracy	Fraction of correct responses	Should not degrade

Notably, GrACE outperforms six baselines (Likelihood, Platt scaling, P(True), verbal prompting, Apricot, ActCab) on both calibration and discrimination, while preserving task accuracy:

Method	ECE	Brier	AUROC	Accuracy (Llama2-7B/TriviaQA)
GrACE	8.41%	16.78%	84.22%	~59.9%
Apricot	11.88%	18.55%	81.61%	~59.9%

Absolute ECE and AUROC improvements of 2–5% are observed across architectures and datasets (Zhang et al., 11 Sep 2025).

5. Test-Time Scaling and Sampling Strategies

Building on the ability to assign confidence to each model output, GrACE defines two key Test-Time Scaling (TTS) strategies:

GrACE-SC (Self-Consistency): Sample $T$ answer-confidence pairs $\{(a_i, c_i)\}$ . Select the answer $a$ that maximizes the summed confidences: $a^* = \arg\max_a \sum_{i: a_i = a} c_i$ .
GrACE-ES (Early Stopping): Sequentially sample $(a_i, c_i)$ . If $c_i \geq \tau$ for user threshold $\tau$ , immediately return $a_i$ ; else, after $T$ samples fallback to GrACE-SC. This protocol increases sample efficiency (e.g., 80% of ARC_C queries require only one sample for $\tau=0.8$ ), while improving both accuracy and computational cost relative to naive majority voting or consistency baselines.

6. Practical Considerations and Applications

GrACE incurs negligible computational and memory cost—just one extra token per answer and LoRA-only fine-tuning (≤0.4% of parameters). The method is agnostic to domain; calibration sets of a few thousand QA pairs suffice, and out-of-domain generalization leads to less than 5% drop in ECE. Confidence estimation is available in a single pass, suitable for integration into real-time systems.

Primary application domains include:

Healthcare: Flagging low-confidence clinical answers and evidence summaries.
Finance: Surfacing low-confidence risk assessments and key recommendations.
Law: Automatically deferring to human review when model confidence is insufficient.

7. Limitations and Open Directions

GrACE supplies a single confidence scalar per entire generated response; step- or sentence-level granularity (e.g., for chain-of-thought explanations) is not currently supported. The approach is restricted to factual correctness—finer axes such as completeness or stylistic suitability are not calibrated. GrACE elicits but does not correct or repair model errors.

Possible research extensions include:

Generalizing from a single $\langle\mathrm{CNF}\rangle$ token to hierarchical or multi-token granularity for more detailed introspection.
Modeling and calibrating multiple quality axes (factuality, relevance, completeness) in parallel.
Incorporating the confidence signal into online self-improvement or learning routines.

8. Impact and Significance

GrACE establishes a minimally invasive, fast, and robust solution for real-time confidence elicitation in LLMs. Its architecture is directly compatible with any causal LLM (with LoRA adapters), requires no bespoke calibration models, and provides immediate, empirically calibrated confidence estimates at inference time. Empirical results demonstrate improved safety, reliability, and efficiency in downstream deployment settings, with a clear performance lead over prior art in both discriminative power and calibration (Zhang et al., 11 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to GRACE.