Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

GrACE: Generative Confidence in LLMs

Updated 12 November 2025
  • GrACE is a mechanism that provides calibrated confidence estimates for LLM outputs using a novel generative reflection process.
  • It appends a dedicated confidence token to measure uncertainty in a single forward pass without extra computational overhead.
  • Benchmark results show improved calibration and discrimination metrics, making GrACE effective for high-stakes domains like healthcare and finance.

GrACE (Generative cAlibrated Confidence Elicitation) is a mechanism for real-time, calibrated confidence estimation in LLMs, designed to be both computationally lightweight and robustly reliable for high-stakes applications such as healthcare, finance, and legal domains. GrACE uniquely elicits confidence by leveraging a generative reflection step in the model's output sequence, directly utilizing model internal representations to produce a well-calibrated, discriminative, and instantly available confidence score, all without resorting to sampling, auxiliary models, or significant computation overhead.

1. Motivation and Problem Definition

LLMs, while capable of fluent and contextually relevant text generation, are prone to producing factually incorrect or "hallucinated" outputs. In safety-critical environments, it is essential for an LLM not only to provide answers, but also to output a trustworthy assessment of its own uncertainty. Confidence elicitation formalizes this as the computation of a scalar cc for each prediction yy, with two desiderata:

  • Discrimination: The score should reliably separate correct from incorrect outputs: P(cm>cnym correct,yn incorrect)1P(c_m > c_n\,|\, y_m \text{ correct}, y_n \text{ incorrect}) \approx 1.
  • Calibration: Empirical correctness should align with confidence: P(y correctc=p)pP(y \text{ correct} | c = p) \approx p for all pp.

Existing approaches often require expensive sampling, cumbersome auxiliary models, or produce poorly calibrated confidences, limiting their practical deployment in domains where real-time, actionable uncertainty is mandatory (Zhang et al., 11 Sep 2025).

2. Generative Confidence Mechanism

GrACE's core innovation is the introduction of an explicit “confidence” token, CNF\langle\mathrm{CNF}\rangle, appended immediately after each generated model answer in both training and inference. The model, after generating its standard response yy, is prompted to generate CNF\langle\mathrm{CNF}\rangle, effectively “reflecting” on its just-produced answer within its latent space.

Let:

  • zLCNFRdz_L^{\langle\text{CNF}\rangle} \in \mathbb{R}^d denote the last hidden state at the CNF\langle\mathrm{CNF}\rangle position,
  • eCNFRde^{\langle\text{CNF}\rangle} \in \mathbb{R}^d the learned embedding for CNF\langle\mathrm{CNF}\rangle,
  • ERV+1×dE\in \mathbb{R}^{|V|+1 \times d} the full embedding matrix.

The confidence score is defined as: c=sim(zLCNF,eCNF)=[softmax(EzLCNF)]CNFc = \mathrm{sim}(z_L^{\langle\text{CNF}\rangle},\,e^{\langle\text{CNF}\rangle}) = \left[\mathrm{softmax}(E\,z_L^{\langle\text{CNF}\rangle})\right]_{\langle\text{CNF}\rangle} This directly reflects the model’s built-in probability of producing CNF\langle\mathrm{CNF}\rangle, thus obtaining a scalar c[0,1]c\in[0, 1] in a single forward pass. No additional tokens, prompts, or passes are required beyond the initial completion.

3. Calibration Fine-Tuning Objective

To endow the CNF\langle\mathrm{CNF}\rangle-based confidence with empirical calibration, GrACE leverages a targeted fine-tuning procedure:

  • Dataset Construction: A calibration set Dc={(xi,yi,ti)}\mathcal{D}_c = \{(x_i, y_i, t_i)\}, where xix_i is the prompt, yiy_i the model's answer, and tit_i the empirical accuracy for the bin into which the initial probe score sis_i fell.
  • Parameter Tuning: Only LoRA adapters (parameter-efficient adapters) and eCNFe^{\langle\text{CNF}\rangle} are updated, constituting approximately 0.3–0.4% of total parameters.
  • Loss Function:

LT=1Dci(tici)2γlogpθ(yixi)\mathcal{L}_T = \frac{1}{|\mathcal{D}_c|} \sum_{i} (t_i - c_i)^2 - \gamma\log p_\theta(y_i|x_i)

where cic_i is the model’s confidence for item ii and γ\gamma controls the calibration (MSE) vs. language modeling (SFT) trade-off.

This enables efficient convergence (3 epochs, batch size 8, lr=1e5\text{lr}=1\mathrm{e}{-5} on a single RTX3090) and avoids overfitting, since only low-capacity adapters and a single token vector are trained.

4. Benchmarks, Metrics, and Comparative Evaluation

The efficacy of GrACE is established on open-ended QA datasets (TriviaQA, SciQ) and across multiple LLM backbones (Phi-3-3.8B, Llama2-7B, Llama3.1-8B-Instruct). Correctness is measured by ROUGE-based metrics.

Evaluation proceeds along standard axes:

Metric Definition Target
ECE Expected Calibration Error (10 bins) Lower is better
Brier E[(c1{correct})2]E[(c - \mathbb{1}\{\text{correct}\})^2] Lower is better
AUROC Area under ROC for discrimination Higher is better
Accuracy Fraction of correct responses Should not degrade

Notably, GrACE outperforms six baselines (Likelihood, Platt scaling, P(True), verbal prompting, Apricot, ActCab) on both calibration and discrimination, while preserving task accuracy:

Method ECE Brier AUROC Accuracy (Llama2-7B/TriviaQA)
GrACE 8.41% 16.78% 84.22% ~59.9%
Apricot 11.88% 18.55% 81.61% ~59.9%

Absolute ECE and AUROC improvements of 2–5% are observed across architectures and datasets (Zhang et al., 11 Sep 2025).

5. Test-Time Scaling and Sampling Strategies

Building on the ability to assign confidence to each model output, GrACE defines two key Test-Time Scaling (TTS) strategies:

  • GrACE-SC (Self-Consistency): Sample TT answer-confidence pairs {(ai,ci)}\{(a_i, c_i)\}. Select the answer aa that maximizes the summed confidences: a=argmaxai:ai=acia^* = \arg\max_a \sum_{i: a_i = a} c_i.
  • GrACE-ES (Early Stopping): Sequentially sample (ai,ci)(a_i, c_i). If ciτc_i \geq \tau for user threshold τ\tau, immediately return aia_i; else, after TT samples fallback to GrACE-SC. This protocol increases sample efficiency (e.g., 80% of ARC_C queries require only one sample for τ=0.8\tau=0.8), while improving both accuracy and computational cost relative to naive majority voting or consistency baselines.

6. Practical Considerations and Applications

GrACE incurs negligible computational and memory cost—just one extra token per answer and LoRA-only fine-tuning (≤0.4% of parameters). The method is agnostic to domain; calibration sets of a few thousand QA pairs suffice, and out-of-domain generalization leads to less than 5% drop in ECE. Confidence estimation is available in a single pass, suitable for integration into real-time systems.

Primary application domains include:

  • Healthcare: Flagging low-confidence clinical answers and evidence summaries.
  • Finance: Surfacing low-confidence risk assessments and key recommendations.
  • Law: Automatically deferring to human review when model confidence is insufficient.

7. Limitations and Open Directions

GrACE supplies a single confidence scalar per entire generated response; step- or sentence-level granularity (e.g., for chain-of-thought explanations) is not currently supported. The approach is restricted to factual correctness—finer axes such as completeness or stylistic suitability are not calibrated. GrACE elicits but does not correct or repair model errors.

Possible research extensions include:

  • Generalizing from a single CNF\langle\mathrm{CNF}\rangle token to hierarchical or multi-token granularity for more detailed introspection.
  • Modeling and calibrating multiple quality axes (factuality, relevance, completeness) in parallel.
  • Incorporating the confidence signal into online self-improvement or learning routines.

8. Impact and Significance

GrACE establishes a minimally invasive, fast, and robust solution for real-time confidence elicitation in LLMs. Its architecture is directly compatible with any causal LLM (with LoRA adapters), requires no bespoke calibration models, and provides immediate, empirically calibrated confidence estimates at inference time. Empirical results demonstrate improved safety, reliability, and efficiency in downstream deployment settings, with a clear performance lead over prior art in both discriminative power and calibration (Zhang et al., 11 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GRACE.