Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision (2506.03723v1)

Published 4 Jun 2025 in cs.CL and cs.AI

Abstract: Uncertainty calibration is essential for the safe deployment of LLMs, particularly when users rely on verbalized confidence estimates. While prior work has focused on classifiers or short-form generation, confidence calibration for chain-of-thought (CoT) reasoning remains largely unexplored. Surprisingly, we find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of LLMs, without any explicit reasoning supervision or reinforcement learning-based rewards. Despite being trained only to produce a verbalized confidence score without any self-verifying examples, the model learns to generate longer and self-checking responses for low-confidence queries while providing more concise answers for high-confidence ones. We further propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty. Experiments on GSM8K and held-out reasoning tasks such as MATH-500 and ARC-Challenge show that our confidence-aware fine-tuning improves both calibration and accuracy, while also enhancing interpretability by aligning the model's reasoning path with its confidence.

PDF Abstract

This paper introduces a method called Confidence-Supervised Fine-Tuning (CSFT) to improve the calibration of verbalized confidence in LLMs, particularly for Chain-of-Thought (CoT) reasoning tasks. The core idea is that fine-tuning an LLM to predict a verbalized confidence score, derived from its own performance, can lead to better calibrated models that also exhibit emergent self-verification behaviors without explicit reasoning supervision.

Problem Addressed:

LLMs often produce incorrect outputs with high confidence, which is problematic in high-stakes applications. While verbalized confidence is more interpretable for users, calibrating it for CoT reasoning, where models generate explicit reasoning steps, has been largely unexplored. Existing methods often rely on complex procedures or don't generalize well.

Proposed Method: Confidence-Supervised Fine-Tuning (CSFT)

CSFT fine-tunes a pre-trained LLM to generate a response structured as follows:

A CoT reasoning trace enclosed in > ... tags.
A final answer enclosed in <answer> ... </answer> tags.
A verbalized confidence score (e.g., a number from 0 to 100) enclosed in <confidence> ... </confidence> tags, elicited by a suffix confidence prompt.

Implementation Details:

Self-Confidence Label Generation:

For each training question $q$ , $K$ full generations (reasoning $r^{(i)}$ and answer $a^{(i)}$ ) are sampled from the model $f_\theta(\cdot \mid q)$ . The empirical accuracy $\hat{p}(q)$ is calculated as the fraction of these $K$ sampled answers $a^{(i)}$ that match the gold answer $a^\star$ :

$\hat{p}(q) = \frac{1}{K} \sum_{i=1}^{K} \mathbbm{1}\bigl[a^{(i)} = a^\star\bigr]$

The self-confidence label $c$ is then obtained by discretizing this accuracy: $c = \lfloor 100 \cdot \hat{p}(q) \rfloor$ .
Training Objective:

The model is fine-tuned using Low Rank Adaptation (LoRA). The primary loss function is a masked cross-entropy loss, applied only to the token positions $T_c$ corresponding to the confidence score (including the <confidence> tags):

$\mathcal{L}_{\text{CSFT}} = - \sum_{t \in T_c} \log p_\theta(y_t \mid y_{<t}, q)$

Here, $p_\theta(y_t \mid y_{<t}, q)$ is the probability of the token $y_t$ given the preceding tokens and the question, predicted by the LoRA-adapted model $f_\theta$ . Optionally, a KL regularization term can be added to keep the model's output distribution for the CoT and answer spans ( $T_{\text{KL}}$ ) close to that of the original pre-trained model $f_{\theta_0}$ :

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CSFT}} + \lambda \sum_{t \in T_{\text{KL}}} \text{KL}\!\left( p_\theta(\cdot \mid y_{<t}, q) \;\|\; p_{\theta_0}(\cdot \mid y_{<t}, q) \right)$

where $\lambda$ is a weighting hyperparameter (set to 0 by default in most experiments). The reasoning $r$ and answer $a$ are not directly supervised during this fine-tuning.

Key Findings and Contributions:

Improved Calibration and Accuracy: CSFT significantly improves calibration metrics like Expected Calibration Error (ECE) and Brier Score (BS), as well as accuracy, on reasoning tasks. This was demonstrated on LLaMA3.2-3B-Instruct and Qwen2.5-1.5B-Instruct models.
Emergent Self-Verification: A striking finding is that CSFT elicits self-verification behavior without any explicit supervision for it. Models learn to:
- Generate longer, more detailed, and self-checking reasoning traces (e.g., using phrases like "recalculate" or "let me double-check") for questions where they have low confidence.
- Produce more concise and decisive answers for high-confidence questions. This behavior was observed through increased output length and higher rates of self-verification (measured by GPT-4.1) in low-confidence bins.
Generalization: The benefits of CSFT, including improved calibration and emergent self-verification, generalize to held-out reasoning tasks (MATH-500, ARC-Challenge) that are structurally and topically different from the training data (GSM8K).
Scalability: CSFT offers a scalable way to build uncertainty-aware LLMs using standard supervised fine-tuning pipelines, without architectural changes or complex reinforcement learning.

Experimental Setup:

Models: LLaMA3.2-3B-Instruct, Qwen2.5-1.5B-Instruct.
Training Data: GSM8K training split. For each question, $K=10$ CoT traces and answers were sampled to generate self-consistency labels.
Evaluation Datasets: GSM8K test set (in-distribution), MATH-500, and ARC-Challenge (out-of-distribution/held-out).
Metrics: Accuracy (ACC), Area Under the Receiver Operating Characteristic curve (AUROC), Expected Calibration Error (ECE), Brier Score (BS), and average generation length.

Analysis of Self-Verification:

The paper shows a clear inverse relationship between the model's verbalized confidence and the length of its generated CoT. For instance, on GSM8K, low-confidence predictions resulted in significantly longer outputs with phrases indicating self-checking. This suggests that the model learns to adapt its reasoning effort based on its internal uncertainty. Qualitative examples (Figure 3) illustrate this:

Low-confidence example: The CSFT model produces a long, reflective trace, identifies an initial error, recalculates, and arrives at the correct answer, while the zero-shot baseline fails. The CSFT model verbalizes low confidence (e.g., 10).
High-confidence example: Both CSFT and zero-shot models answer correctly, but the CSFT model's response is more concise, reflecting its high verbalized confidence (e.g., 100).

Ablation Studies:

Prompt Position:
- Suffix Prompting (confidence after answer): This is the main approach. It improves calibration and accuracy, even without KL regularization ( $\lambda=0$ ). It doesn't interfere with the primary reasoning process.
- Prefix Prompting (confidence before reasoning): Leads to lower accuracy, especially without KL regularization. The model might overfit to expressing uncertainty rather than reasoning correctly. This suggests prefix prompting needs careful regularization.
KL Regularization:
- Crucial for prefix prompting to prevent performance degradation.
- Less critical, or even slightly detrimental (performance improves when $\lambda=0$ ), for suffix prompting, as confidence is predicted post-hoc.
Label Quality: Using random confidence labels instead of self-consistency based labels significantly degrades performance, highlighting the importance of meaningful supervision.
Confidence Question: Removing the explicit "How confident are you..." question and asking for a scalar confidence directly after the answer (e.g., using another <answer> tag) caused training to collapse. This indicates the explicit prompt is necessary to ground the meaning of the confidence score.
CoT Visibility: Eliciting confidence before CoT generation versus after showed similar calibration profiles, suggesting the model's confidence reflects internal uncertainty rather than just relying on the observed length or content of the generated CoT.

Confidence-Guided Reasoning Path Refinement:

The paper explores a practical application: if the model expresses low confidence (elicited before CoT generation using a prefix prompt), its reasoning can be manually redirected. By providing an altered or more structured prompt to initiate a new reasoning attempt for these low-confidence samples, accuracy was substantially improved (e.g., >55 percentage point increase in the 0-10 confidence range on GSM8K). This highlights the utility of calibrated confidence for guiding reasoning-time interventions.

Conclusion and Future Work:

CSFT is a simple and effective method for training LLMs to produce well-calibrated verbalized confidence for CoT reasoning, leading to emergent self-verification behaviors. This enhances reliability and interpretability. Future directions include:

Using pre-CoT confidence to predict downstream reasoning costs (e.g., length, compute).
Developing confidence-conditioned inference policies to balance accuracy and efficiency.
Addressing issues where low-confidence generations lead to excessively long or redundant reasoning.
Exploring confidence-aware steering of CoT trajectories or triggering "rethink" interventions based on latent confidence signals.

This method provides a practical pathway towards LLMs that are not only more accurate but also more aware and communicative about their own uncertainty during complex reasoning tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Chaeyun Jang (4 papers)
Moonseok Choi (7 papers)
Yegon Kim (2 papers)
Hyungi Lee (15 papers)
Juho Lee (106 papers)

Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision (2506.03723v1)

Related Papers