Thought calibration: Efficient and confident test-time scaling (2505.18404v1)

Published 23 May 2025 in cs.LG and cs.AI

Abstract: Reasoning LLMs achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a LLM's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the LLM's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning LLMs and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

Summary

The paper presents thought calibration to dynamically decide when to halt LLM reasoning, reducing token usage and cost.
It leverages a Learn-then-Test framework with lightweight probes to estimate reasoning risk from hidden representations.
Experiments reveal up to 60% token reduction on in-distribution tasks and 20% on OOD datasets while preserving accuracy.

This paper introduces "thought calibration," a method for dynamically deciding when a LLM can stop generating intermediate "thoughts" during reasoning tasks, thereby reducing computational cost without significantly sacrificing performance. The core idea is to view an LLM's reasoning process as the growth of an abstract reasoning tree. The goal is to terminate thinking when this tree is unlikely to grow further or yield a different final answer.

Core Problem and Proposed Solution

LLMs often achieve better reasoning performance by generating more intermediate steps ("thinking longer"). However, this "test-time scaling" incurs significant computational costs. Naively limiting the generation length can hurt performance, as not all problems require the same amount of reasoning. Thought calibration aims to address this by providing an efficient and statistically calibrated decision rule for early stopping.

Theoretical Framework

Reasoning Graph/Tree: The authors formalize LLM reasoning using an abstract reasoning graph $G$ $G$ , where nodes are thoughts and edges represent entailment. A reasoning trajectory $z$ $z$ is a root-to-leaf walk. The set of generated thoughts $y_t$ $y_{t}$ at step $t$ $t$ defines a subgraph $G_t$ $G_{t}$ . The ideal stopping point is when $G_t$ $G_{t}$ is unlikely to change significantly with further generation ( $G_t \approx G_T$ $G_{t} \approx G_{T}$ , where $T$ $T$ is the maximum budget) or when it already contains the correct answer $z^*$ $z^{*}$ .
- Ideal goal (correctness): $\mathbb{P}\left( \mathbb{E} \left[ \mathbbm{1}[ z^* \not\in G_t ] \le \delta \right] \right) \ge 1-\epsilon$
- Practical goal (consistency): $\mathbb{P}\left( \mathbb{E} \left[ \mathbbm{1}[ G_t \ne G_T ] \le \delta \right] \right) \ge 1-\epsilon$
Learn then Test (LTT) for Calibration: The decision to stop is calibrated using the Learn then Test (LTT) framework. This framework views hyperparameter selection (here, a threshold $\lambda$ $λ$ for a surrogate function $f$ $f$ ) as a multiple hypothesis testing problem.
- For a set of candidate thresholds $\Lambda = \{\lambda_1, \dots, \lambda_m\}$ , each $\lambda_j$ is associated with a null hypothesis $H_j : \mathbb{E} [R(y_t) > \delta]$ , where $R(y_t)$ is the risk of stopping at $y_t$ .
- Using a calibration dataset $\mathcal{D}_\text{cal}$ , p-values $p_j$ are computed for each $H_j$ . The binomial tail bound p-value is used: $p_\lambda^{BT} := \mathbb{P}( \text{Binom}(n, \epsilon) \le n \hat R_n(\lambda) )$ .
- A family-wise error rate (FWER) controlling algorithm (fixed sequence testing in this paper) selects the smallest $\lambda_{j-1}$ for which $p_j > \epsilon$ , ensuring the risk control guarantee.

Estimating Empirical Risk with Probes

Since directly observing $G_t$ or $z^*$ during inference is hard, lightweight probes operating on the LLM's hidden representations are used to estimate surrogate functions $f$ .

$f_\text{correct}(y_t)$ : Predicts the probability that the LLM will answer correctly given thoughts $y_t$ $y_{t}$ .
- Risk $R_\text{correct}(y_t) := \mathbbm{1}\{\text{LLM is correct}\} \cdot (1 - f_\text{correct}(y_t)) + \mathbbm{1}\{\text{LLM is wrong}\} \cdot f_\text{correct}(y_t)$.
- Drawback: Requires labeled data for correctness and assumes problems are solvable.
$f_\text{consistent}(y_t)$ : Predicts the probability that the final answer $z_t$ $z_{t}$ (derived from $y_t$ $y_{t}$ ) is the same as the answer $z_T$ $z_{T}$ (derived from thoughts generated up to the maximum budget $T$ $T$ ).
- Risk $R_\text{consistent}(y_t) := \mathbbm{1}\{\text{consistent}\} \cdot (1 - f_\text{consistent}(y_t)) + \mathbbm{1}\{\text{inconsistent}\} \cdot f_\text{consistent}(y_t)$.
- Advantage: Does not require correctness labels and applies to intractable problems.
$f_\text{novel leaf}(y_t)$ : Estimates if the current thought $y^{(t)}$ $y^{(t)}$ is a novel leaf in the reasoning tree. The paper actually formulates it as detecting non-novel leaves: $\mathbb{P}(y^{(t)} \text{ is leaf}) \cdot (1 - \mathbb{P}(y^{(t)} \text{ is novel}))$ .
- Risk $R_\text{novel leaf}(y_t)$ is defined similarly, using consistency labels for easier verification.
- Captures the idea that the LLM might be reiterating information.

Implementation Details

Thought Segmentation: Reasoning trajectories $y$ are split into steps $y^{(i)}$ using \n\n delimiters that also contain "wait" or "but".
Step Representation: For each step $y^{(i)}$ , the mean of the last-layer hidden representations of its tokens is taken, followed by PCA to $d=256$ dimensions.
Probe Architecture: Linear probes are trained on these step-level representations. Probabilities are averaged over a window of 10 steps for smoothness before calibration. (The appendix notes experiments with MLPs and Transformers, but linear probes were chosen for simplicity and to avoid overfitting on limited training data.)
Ground Truth Labeling for Probes: A separate LLM (Qwen 3 32B) is prompted to generate labels for correctness, consistency, leaf identification, and novelty.
- Correctness: Truncate thoughts, prompt for final answer, compare to ground truth.
- Consistency: Compare intermediate answers $z_t$ to the maximum budget answer $z_T$ .
- Leaf: Ask LLM if step $y^{(i)}$ attempts to answer the original question.
- Novelty: Provide previous thoughts $y^{(1)}\dots y^{(i-1)}$ and ask if $y^{(i)}$ adds new information.
Evaluation: Final answer correctness is evaluated using the GPT 4.1 API.
Models: DeepSeek-R1 distilled Qwen 2.5 32B, Llama 3.3 70B, and QwQ 32B.
Software: vLLM and lmdeploy for running LLMs.

Pseudocode for Thought Calibration during Inference:

function run_thought_calibration(problem_x, LLM, calibrated_threshold_lambda, probe_f, max_steps_T):
  thoughts_y = []
  hidden_states_h = []
  for t in 1 to max_steps_T:
    next_thought_y_t, next_hidden_state_h_t = LLM.generate_next_thought(problem_x, thoughts_y)
    thoughts_y.append(next_thought_y_t)
    hidden_states_h.append(next_hidden_state_h_t)

    // Preprocess hidden state for probe (e.g., mean pooling, PCA)
    processed_h_t = preprocess(next_hidden_state_h_t)

    // Get probe score (potentially smoothed over a window)
    probe_score = probe_f(processed_h_t, history=hidden_states_h) // or just probe_f(processed_h_t)

    if probe_score >= calibrated_threshold_lambda:
      stop_thinking = True
      break
  
  final_answer_z = LLM.synthesize_answer(problem_x, thoughts_y)
  return final_answer_z, thoughts_y

Experimental Results

In-distribution (s1K-1.1 dataset):
- Thought calibration (especially "Consistent" and "Novel Leaf" probes) preserved model performance with up to a 60% reduction in thinking tokens.
- The "Supervised" (correctness-based) probe was poorly calibrated, likely because the test set may contain unsolvable problems, making it overconfident.
- "Consistent" and "Novel Leaf" probes were generally well-calibrated, especially for error levels $\epsilon < 0.1$ .
Figure 1: Performance on in-distribution data (s1K-1.1). Thought calibration variants (Consistent, Leaf Novelty) reduce tokens significantly while maintaining accuracy.
Generalization (AIME 24, GPQA Diamond, MATH-500):
- Probes trained on s1K-1.1 were applied to these OOD datasets.
- Achieved up to a 20% reduction in thinking tokens.
- The "Consistent" probe generalized better than "Supervised" in terms of efficiency and calibration, meeting theoretical guarantees while "Supervised" remained overconfident.
- On AIME 24, slight performance gains were observed, possibly by trimming distracting thoughts.
Figure 2: Performance on out-of-distribution datasets. "Consistent" probe maintains efficiency and calibration better than "Supervised".
Analysis:
- Thought calibration tends to terminate early more often for problems the LLM cannot solve even with full budget, suggesting it identifies when the model is stuck.
- The method provides input-dependent token reduction, unlike naive cropping.
- Qualitative examples show the "Consistent" probe's confidence aligning with the LLM's own backtracking and convergence on an answer.
Figure 3: Example of "Consistency" probe confidence during a reasoning trajectory. Confidence drops on backtracking and increases upon re-converging to the answer.

Practical Implications and Applications

Reduced Inference Costs: Dynamically stopping generation significantly reduces the number of tokens processed, leading to lower computational requirements and latency for LLM reasoning tasks.
Adaptive Resource Allocation: The method allows the model to "think more" for harder problems and "less" for easier ones, optimizing resource use.
Deploying Resource-Intensive Models: Makes it more feasible to deploy large reasoning models in applications where inference budget is a constraint.
User Experience: Faster responses for problems that don't require extensive reasoning.

Limitations

Calibration Data Dependency: LTT guarantees are valid if the calibration data is representative of the application data.
Probe Simplicity: Linear probes were used due to small datasets. More complex probes might offer better performance with more training data but could also overfit.
Scope: Addresses early exiting from reasoning, not the broader problem of steering or guiding the reasoning process itself.

In summary, "Thought calibration" offers a principled and empirically effective approach to make LLM reasoning more efficient by learning when to stop the generation of intermediate thoughts. It leverages lightweight, learnable probes and statistical calibration (LTT) to achieve significant token reduction while preserving performance, with practical benefits for deploying LLMs in real-world scenarios. The "Consistent" probe, which checks if the current answer is likely to be the final one, shows particular promise due to its good performance and lack of need for correctness labels.

PDF Markdown

Thought calibration: Efficient and confident test-time scaling (2505.18404v1)

Summary

Related Papers