HonestyBench: Confidence Alignment Benchmark

Updated 4 July 2026

The paper introduces HonestyBench, a large-scale benchmark designed to align LLM confidence scores with actual correctness using free-form QA datasets.
It employs both correctness and semantic self-consistency signals to train confidence predictors and evaluate model calibration.
The benchmark supports a two-stage EliCal framework that achieves near-optimal alignment with minimal correctness annotations for improved calibration.

HonestyBench is a large-scale benchmark for honesty alignment of LLMs that aggregates 10 public free-form factual QA datasets and, for each model–question pair, provides multiple generated answers, a correctness signal for each answer, and a self-consistency-based confidence signal derived from semantic agreement across samples (Ni et al., 20 Oct 2025). It is designed both as a training corpus for learning confidence predictors and as a benchmark for evaluating how well models’ confidence aligns with their actual probability of correctness, both in-domain and out-of-domain (Ni et al., 20 Oct 2025). In the formalization used by the benchmark, honesty alignment means learning a mapping from an input question $q$ to a confidence score $\text{Confidence}_\theta(q)\in[0,1]$ that matches the model’s true probability of answering correctly under its decoding policy $\pi$ (Ni et al., 20 Oct 2025). Within the broader honesty literature, this corresponds closely to the self-knowledge axis—recognition of known versus unknown and calibration—rather than to refusal, sycophancy, or deception alone (Li et al., 2024).

1. Conceptual definition and formal objective

HonestyBench treats honesty as confidence alignment. Given a model with parameters $\theta$ , a response distribution $p_\theta^\pi(r\mid q)$ , and a set of correct responses $\mathcal{G}(q)$ , it defines response correctness as

$\text{Accuracy}_{\theta}(q,r) \triangleq \mathbb{I}\!\left[\, r \in \mathcal{G}(q) \,\right] \in \{0,1\},$

and the model’s true capability on $q$ as

$\text{Accuracy}_{\theta}(q) \triangleq \mathbb{E}_{r \sim p_\theta^\pi(\cdot \mid q)} \!\left[\, \text{Accuracy}_{\theta}(q,r) \,\right].$

The ideal honesty objective is then

$\text{Confidence}^*_{\theta}(q) = \text{Accuracy}_{\theta}(q).$

In this formulation, an honest model is one whose reported confidence reflects its true probability of being correct on that question, and which can express this confidence before generation (Ni et al., 20 Oct 2025).

This definition differs from formulations that center honesty on explicit “I don’t know” behavior. “Alignment for Honesty” frames honesty as answering questions a model knows and giving an explicit idk response on questions it does not (Yang et al., 2023). By contrast, HonestyBench centers a scalar confidence prediction over free-form QA. A plausible implication is that it operationalizes honesty at the level of pre-generation epistemic state rather than at the level of post-generation refusal style.

2. Dataset composition and annotation scheme

HonestyBench is split into HonestyBench-Train, with about 560k training samples, and HonestyBench-Eval, with about 70k evaluation instances divided into in-domain and out-of-domain subsets (Ni et al., 20 Oct 2025).

Split	Datasets	Count
Train	NQ, TQ, HQ, 2Wiki, ParaRel	567,647
Eval, in-domain	NQ, TQ, HQ, 2Wiki, ParaRel	37,904
Eval, OOD	SQuAD, WQ, CWQ, MuSiQue, PopQA	32,805

The training portion contains NQ Train with 87,925 instances, TQ Train with 87,622, HQ Train with 90,447, 2Wiki Train with 167,454, and ParaRel Split with 134,199 (Ni et al., 20 Oct 2025). The in-domain evaluation portion contains NQ Test with 3,610 instances, TQ Dev with 11,313, HQ Dev with 7,405, 2Wiki Dev with 12,576, and ParaRel Split with 3,000 (Ni et al., 20 Oct 2025). The out-of-domain evaluation portion contains SQuAD Dev with 10,570 instances, WQ Test with 2,032, CWQ Dev with 3,519, MuSiQue Dev with 2,417, and PopQA Dev with 14,267 (Ni et al., 20 Oct 2025).

For each QA pair, the benchmark pre-generates outputs from three instruct models: Qwen2.5-7B-Instruct, Qwen2.5-14B-Instruct, and Llama3-8B-Instruct (Ni et al., 20 Oct 2025). For each question and each model, it stores 1 greedy response and 20 sampled responses (Ni et al., 20 Oct 2025). Correctness for each generated response is judged with Qwen2.5-32B-Instruct, and semantic consistency between sampled responses and the greedy response is also judged with Qwen2.5-32B-Instruct (Ni et al., 20 Oct 2025).

The benchmark’s second annotation channel is self-consistency. Let $\text{Confidence}_\theta(q)\in[0,1]$ 0 be the greedy answer, and let $\text{Confidence}_\theta(q)\in[0,1]$ 1 denote a binary semantic consistency indicator: $\text{Confidence}_\theta(q)\in[0,1]$ 2 Then the self-consistency confidence is approximated by

$\text{Confidence}_\theta(q)\in[0,1]$ 3

with $\text{Confidence}_\theta(q)\in[0,1]$ 4 sampled responses (Ni et al., 20 Oct 2025). HonestyBench therefore stores both a correctness-based target and a self-consistency-based proxy target for each model–question pair.

3. Evaluation protocol and metrics

HonestyBench evaluates confidence methods against the correctness of the greedy answer. Its primary discrimination metric is AUROC, which measures how well the model’s confidence scores separate correct from incorrect answers on a per-question basis (Ni et al., 20 Oct 2025). Its primary calibration metric is Expected Calibration Error (ECE): $\text{Confidence}_\theta(q)\in[0,1]$ 5 with $\text{Confidence}_\theta(q)\in[0,1]$ 6 confidence bins in the reported experiments (Ni et al., 20 Oct 2025).

The benchmark also reports an alignment metric intended for abstention-like operating points. A threshold $\text{Confidence}_\theta(q)\in[0,1]$ 7 is selected on 20% of an evaluation set to maximize agreement between the binarized confidence decision and actual correctness, and the resulting threshold is then evaluated on the remaining 80% (Ni et al., 20 Oct 2025). This supports deployment-style questions such as when to trust the model, when to abstain, or when to trigger retrieval.

HonestyBench compares training-based methods against six training-free confidence baselines: Prob, N-Prob, Verbal-0, Verbal-10, Consis-Lex, and Consis-Sem (Ni et al., 20 Oct 2025). Among these, Consis-Sem is consistently the best training-free approach on HonestyBench-Eval, which is why semantic self-consistency is used as the Stage 1 target in EliCal (Ni et al., 20 Oct 2025).

A notable design choice is that correctness and honesty are explicitly separated. QA accuracy is measured independently as the fraction of greedy answers judged correct, while honesty alignment is assessed through AUROC, ECE, and thresholded alignment (Ni et al., 20 Oct 2025). This allows a model to be accurate but poorly calibrated, or less accurate but well aligned in confidence.

4. Role in the EliCal framework

HonestyBench was released to support Elicitation-Then-Calibration (EliCal), a two-stage framework for annotation-efficient honesty alignment (Ni et al., 20 Oct 2025). The base LLM is frozen, LoRA adapters are inserted into all linear layers, and a linear head is attached to the final-layer hidden state of the last question token to predict a scalar confidence (Ni et al., 20 Oct 2025).

In Stage 1: Confidence Elicitation, the model is trained on the large HonestyBench-Train pool using self-consistency confidence as the target. The objective is mean squared error between predicted confidence and the semantic self-consistency score derived from the 20 sampled responses (Ni et al., 20 Oct 2025). The purpose is to teach the model to approximate, from internal states alone, the confidence that would otherwise require multi-sampling and semantic consistency checking.

In Stage 2: Confidence Calibration, the Stage 1 model is fine-tuned on a much smaller subset with correctness-based targets. Here the target is the approximated capability

$\text{Confidence}_\theta(q)\in[0,1]$ 8

again estimated from the sampled responses (Ni et al., 20 Oct 2025). Because Stage 1 already captures much of the internal signal, Stage 2 can use very few correctness annotations.

The benchmark also supports two comparison regimes: Eli-Only, which trains only on self-consistency signals, and Cal-Only, which trains from scratch only on correctness annotations (Ni et al., 20 Oct 2025). This makes HonestyBench not merely a static evaluation set, but a structured experimental platform for studying annotation efficiency, internal uncertainty elicitation, and post-elicitation calibration.

5. Empirical findings and scaling behavior

HonestyBench establishes a substantial gap between training-free and training-based honesty alignment. For Qwen2.5-7B, the best training-free baseline, Consis-Sem, reaches an in-domain average AUROC of 73.62, whereas Cal-Only(560k) reaches 86.20 and EliCal(560k) reaches 86.49 (Ni et al., 20 Oct 2025). This is the benchmark’s empirical upper-bound regime for that model family.

The central result is annotation efficiency. The abstract reports that EliCal achieves near-optimal alignment with only 1k correctness annotations (0.18% of full supervision) (Ni et al., 20 Oct 2025). In the detailed Qwen2.5-7B in-domain example, Cal-Only(1k) obtains 73.41, while EliCal(1k) obtains 84.36, which is approximately 97.9% of the full-supervision 86.20 achieved by Cal-Only(560k) (Ni et al., 20 Oct 2025). On the out-of-domain evaluation for the same model, Consis-Sem obtains 70.20, Cal-Only(1k) obtains 77.32, EliCal(1k) obtains 84.47, and full Cal-Only(560k) obtains 85.75 (Ni et al., 20 Oct 2025).

The benchmark also supports analysis of generalization beyond free-form factual QA. When models trained only on HonestyBench are evaluated on MMLU, EliCal generalizes better than Cal-Only, even at full annotation (Ni et al., 20 Oct 2025). This suggests that large-scale self-consistency-based elicitation induces a confidence representation that is less tied to dataset-specific correctness labels.

A further finding concerns one-shot inference efficiency. Eli-Only performs similarly to Consis-Sem while avoiding multi-sampling at evaluation time, showing that HonestyBench’s self-consistency annotations are sufficient to train a model to approximate semantic self-consistency directly from hidden states (Ni et al., 20 Oct 2025).

6. Position in the honesty-benchmark landscape and limitations

HonestyBench occupies a specific position within the broader honesty literature. “HonestLLM” introduces HoneSet, a 930-query benchmark centered on “LLM-unable” tasks and evaluates honesty through Honesty Rate and H $\text{Confidence}_\theta(q)\in[0,1]$ 9 scoring (Gao et al., 2024). “BeHonest” evaluates self-knowledge, non-deceptiveness, and consistency across 10 scenarios such as sycophancy, burglar deception, and prompt-format sensitivity (Chern et al., 2024). “Alignment for Honesty” formalizes honesty through prudence, over-conservativeness, and a combined honesty score built around explicit idk behavior (Yang et al., 2023). Related work extends honesty benchmarking to unanswerable visual questions in MoHoBench (Zhu et al., 29 Jul 2025), to academic-integrity dilemmas in SciIntegrity-Bench (Yang et al., 11 May 2026), and to cheap-talk preference-misalignment games in “Truthful AI Advisors” (Balyani et al., 31 May 2026). Within that ecosystem, HonestyBench is distinctive because it is organized around confidence alignment on free-form factual QA rather than around refusal behavior, deception games, or consistency perturbations (Ni et al., 20 Oct 2025).

Its limitations are correspondingly specific. All QA datasets are factual and mostly Wikipedia-based (Ni et al., 20 Oct 2025). Correctness and semantic consistency are judged by Qwen2.5-32B-Instruct, not by human raters, so annotation quality depends on an LLM judge (Ni et al., 20 Oct 2025). The training resource is free-form QA only, even though the benchmark shows encouraging transfer to MMLU (Ni et al., 20 Oct 2025). Self-consistency is measured by agreement with the greedy answer rather than by clustering the full sample set or measuring semantic entropy (Ni et al., 20 Oct 2025). Finally, the benchmark instantiates honesty alignment with a single internal head design—a linear readout plus LoRA—rather than comparing richer architectural alternatives (Ni et al., 20 Oct 2025).

These constraints do not diminish its role as a major resource for large-scale honesty-alignment research. Rather, they define its domain precisely: HonestyBench is a benchmark for learning and evaluating whether an LLM’s internal confidence matches its actual probability of correctness on free-form factual QA, at scale, with both correctness and self-consistency supervision (Ni et al., 20 Oct 2025).