Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Reasoning Models Better Express Their Confidence (2505.14489v1)

Published 20 May 2025 in cs.AI and cs.CL

Abstract: Despite their strengths, LLMs often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models-LLMs that engage in extended chain-of-thought (CoT) reasoning-exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models-such as exploring alternative approaches and backtracking-which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that these gains are not exclusive to reasoning models-non-reasoning models also benefit when guided to perform slow thinking via in-context learning.

PDF Abstract

This paper, "Reasoning Models Better Express Their Confidence" (Yoon et al., 20 May 2025 ), investigates the ability of LLMs that engage in Chain-of-Thought (CoT) reasoning—termed "reasoning models"—to accurately express their confidence. The core problem addressed is the common tendency of LLMs to sound overconfident, even when incorrect, which limits their reliability. The authors demonstrate that reasoning models exhibit superior confidence calibration compared to non-reasoning (standard instruction-tuned) models, and that this improvement stems from "slow thinking" behaviors inherent in their CoT process.

Experimental Setup

The paper benchmarks six reasoning models (derived from four backbone LLMs like Qwen, GLM, EXAONE) against their non-reasoning counterparts. Experiments are conducted on two types of datasets:

Knowledge-focused datasets: TriviaQA and NonambigQA. These are chosen because CoT offers minimal accuracy improvement, allowing for a more direct comparison of calibration abilities independent of task performance.
Reasoning-intensive datasets: Subsets (Math and Non-Math) of SuperGPQA and MMLU-Pro, which are multiple-choice benchmarks requiring complex reasoning.

For all experiments, 1,000 examples were sampled from each dataset/subset.

Inference Procedure:

Models were instructed in a single turn to perform three steps using CoT:

Solution Reasoning: Step-by-step reasoning to arrive at an answer.
Confidence Reasoning: Step-by-step evaluation of their confidence in the derived answer by assessing their own thinking process.
Confidence Verbalization: Mapping their confidence to one of ten predefined bins, each with a linguistic descriptor and a numerical probability range (e.g., "Almost certain (0.9–1.0)").

The prompt used for this process is detailed in Appendix A.5, Listing 1.

First, reason through the question step by step to arrive at an answer.
Then, thoroughly assess your confidence in that answer by evaluating your thinking process so far.
Finally, classify your confidence into one of the following classes based on how likely your answer is to be correct:

- "Almost no chance" (0.0--0.1)
- "Highly unlikely" (0.1--0.2)
- "Chances are slight" (0.2--0.3)
- "Unlikely" (0.3--0.4)
- "Less than even" (0.4--0.5)
- "Better than even" (0.5--0.6)
- "Likely" (0.6--0.7)
- "Very good chance" (0.7--0.8)
- "Highly likely" (0.8--0.9)
- "Almost certain" (0.9--1.0)

Each category reflects the probability that your answer is correct.

At the very end of your output, format your answer and confidence as
**Answer**: $ANSWER
**Confidence**: $CLASS
where CLASS is one of the names (only the names without the probability ranges) of the classes above.

A crucial implementation detail is that some reasoning models (R1-Distill, OR1-Preview, GLM-Z1) rarely engaged in explicit Confidence Reasoning within their thinking block (> ...</think>). To address this, these models were forced to perform Confidence Reasoning by:

Generating the CoT up to the </think> token.

Replacing </think> with the prompt: "Okay, now let's assess my overall thinking process so far step-by-step. I need to evaluate how likely my answer is correct."

Performing a second round of inference to complete the Confidence Reasoning and Verbalization.

Final answers and confidence levels were extracted using rule-based filtering (regex), and answers were matched against ground truth using GPT-4o mini.

Evaluation Metrics:

Expected Calibration Error (ECE): Measures average discrepancy between accuracy and predicted confidence per bin. Lower is better.

Brier Score: Mean squared difference between predicted confidence and true binary outcome. Captures absolute calibration and discriminative ability. Lower is better.

Area Under the ROC Curve (AUROC): Measures discriminative ability (assigning higher confidence to correct vs. incorrect predictions). Higher is better.

Key Findings

Reasoning Models Show Superior Calibration:

Across 33 out of 36 settings, reasoning models strictly outperformed their non-reasoning counterparts in all calibration metrics. This held true for both knowledge-focused and reasoning-intensive datasets.

On knowledge-focused datasets (Table 1), reasoning models showed better calibration even when their task accuracy was comparable or slightly lower than non-reasoning models. This suggests calibration gains are not merely a byproduct of better problem-solving.

For instance, Qwen2.5-Instruct (non-reasoning) on TriviaQA had an ECE of 0.129, while R1-Distill-Qwen (reasoning) achieved 0.042.

Figure 2 illustrates how R1-Distill-Qwen uses a wider, more appropriate range of confidence bins compared to the overconfident Qwen2.5-Instruct.

The Qwen3 model, which supports both a "Thinking Mode" (reasoning) and "Non-thinking Mode", showed significantly better calibration in Thinking Mode, reinforcing that the CoT process itself is key.

Calibration Improves as CoT Unfolds for Reasoning Models:

The paper analyzed how calibration changes as the CoT progresses. CoTs were divided into 11 cumulative segments, and confidence was elicited at each point.

Reasoning models showed a steady, statistically significant improvement in calibration (lower Brier Score, lower ECE, higher AUROC) as more of their CoT was generated (Figure 3, Table 3).

Non-reasoning models showed no such trend; their calibration sometimes even worsened as their CoT unfolded. This suggests reasoning models dynamically adjust and refine their confidence during the slow thinking process.

Slow Thinking Behaviors are Crucial for Calibration:

An ablation paper on R1-Distill-Qwen-32B's CoTs (Table 4) identified which components of slow thinking contribute most to calibration:

Confidence Reasoning (explicitly thinking about confidence): Removing this had only a minor negative impact, suggesting that thinking about the problem itself is more important for calibration than meta-reasoning about confidence.

Epistemic Markers (e.g., "I think", "maybe"): Removing these (using GPT-4.1 to edit CoTs) degraded ECE (models became more overconfident) but improved AUROC. This was because models tended to use fewer, more extreme confidence bins, which, while overconfident, still discriminated correct from incorrect answers well.

Non-linear Reasoning (exploring alternatives, backtracking, refining): Removing these non-linear aspects of CoT (linearizing the thought process using GPT-4.1) had the most significant negative impact on all calibration metrics. This highlights that the ability to explore, self-correct, and consider alternatives during CoT is paramount for accurate confidence estimation.

Non-Reasoning Models Can Benefit from Guided Slow Thinking: When non-reasoning models were prompted with few-shot examples of "slow thinking" CoTs from a reasoning model (R1-Distill-Qwen), they also exhibited improved calibration (Table 5). This demonstrates that the benefits of slow thinking are not exclusive to specialized reasoning models but can be induced.

Practical Implementation Strategies

Prompting for Verbalized Confidence: Use a structured prompt that guides the LLM through solution reasoning, then confidence reasoning, and finally verbalization into predefined bins with linguistic and numerical anchors. The specific prompt and bin structure from the paper (Appendix A.5, Listing 1) can be adopted.

Enforcing Confidence Reasoning: If a model doesn't naturally perform detailed confidence reasoning, the two-step inference process (interrupting before </think>, inserting a confidence reasoning prompt, and continuing) can be implemented.

Leveraging <think> Tags: For models supporting them, ensure reasoning happens within <think>... blocks to isolate the reasoning phase.

Inducing Slow Thinking in Standard LLMs: For non-reasoning models, provide few-shot examples in the prompt that demonstrate "slow thinking" behaviors (exploring alternatives, self-correction, expressing intermediate uncertainty). This can improve their confidence calibration without needing a specialized reasoning model.
CoT Analysis for Debugging/Improvement: The methodology of segmenting CoT and tracking confidence can be a useful diagnostic tool to understand how a model arrives at its confidence score and where its self-assessment might be flawed.
Ablation of CoT Components: For research or advanced model development, using an LLM like GPT-4.1 with carefully crafted prompts (see Appendix A.6, Listings 2 & 3 for examples) to systematically modify CoTs can help pinpoint which reasoning behaviors most influence calibration. This requires careful instruction and validation.
Choice of Confidence Expression: The paper's main experiments found binned verbalized confidence (linguistic + numerical range) effective. Appendix A.2.2 shows that directly outputting a numerical probability without binning led to general degradation in calibration, while linguistic-only descriptors were also effective. Binned approaches seem more robust.

Discussion and Limitations

Quality over Quantity of CoT: Simply forcing longer CoTs (e.g., via budget forcing) does not necessarily lead to better calibration. The quality and nature (e.g., non-linearity) of the slow thinking are more important.
Scaling with Model Size: The calibration benefits of slow thinking appear to become more pronounced in larger, more capable LLMs, suggesting this is a promising direction as models scale.
Room for Improvement: Even reasoning models tend to be somewhat overconfident, rarely using low confidence bins (<55%). Challenges remain, especially when task accuracy is lower (e.g., higher ECE/Brier on NonambigQA vs. TriviaQA).

In conclusion, the paper provides strong evidence that eliciting slow thinking behaviors through CoT allows reasoning models to express their confidence more accurately. This has significant implications for improving the trustworthiness and reliability of LLMs in practical applications. The methods and analyses presented offer concrete pathways for developers to implement and evaluate confidence calibration in their own systems.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Dongkeun Yoon (8 papers)
Seungone Kim (34 papers)
Sohee Yang (23 papers)
Sunkyoung Kim (7 papers)
Soyeon Kim (21 papers)
Yongil Kim (11 papers)
Eunbi Choi (8 papers)
Yireun Kim (9 papers)
Minjoon Seo (82 papers)

Tweets

https://twitter.com/fly51fly/status/1925304211618910644

https://twitter.com/dongkeun_yoon/status/1925181900483956808

https://twitter.com/smellslikeml/status/1925181794971963654

https://twitter.com/ai_24x7/status/1925801021839704343

https://twitter.com/bohannon_bot/status/1925948864202215815

Reasoning Models Better Express Their Confidence (2505.14489v1)

Experimental Setup

Key Findings

Practical Implementation Strategies

Discussion and Limitations

Related Papers

Tweets