Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Thought calibration: Efficient and confident test-time scaling (2505.18404v1)

Published 23 May 2025 in cs.LG and cs.AI

Abstract: Reasoning LLMs achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a LLM's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the LLM's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning LLMs and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

Summary

  • The paper presents thought calibration to dynamically decide when to halt LLM reasoning, reducing token usage and cost.
  • It leverages a Learn-then-Test framework with lightweight probes to estimate reasoning risk from hidden representations.
  • Experiments reveal up to 60% token reduction on in-distribution tasks and 20% on OOD datasets while preserving accuracy.

This paper introduces "thought calibration," a method for dynamically deciding when a LLM can stop generating intermediate "thoughts" during reasoning tasks, thereby reducing computational cost without significantly sacrificing performance. The core idea is to view an LLM's reasoning process as the growth of an abstract reasoning tree. The goal is to terminate thinking when this tree is unlikely to grow further or yield a different final answer.

Core Problem and Proposed Solution

LLMs often achieve better reasoning performance by generating more intermediate steps ("thinking longer"). However, this "test-time scaling" incurs significant computational costs. Naively limiting the generation length can hurt performance, as not all problems require the same amount of reasoning. Thought calibration aims to address this by providing an efficient and statistically calibrated decision rule for early stopping.

Theoretical Framework

  1. Reasoning Graph/Tree: The authors formalize LLM reasoning using an abstract reasoning graph GG, where nodes are thoughts and edges represent entailment. A reasoning trajectory zz is a root-to-leaf walk. The set of generated thoughts yty_t at step tt defines a subgraph GtG_t. The ideal stopping point is when GtG_t is unlikely to change significantly with further generation (GtGTG_t \approx G_T, where TT is the maximum budget) or when it already contains the correct answer zz^*.
    • Ideal goal (correctness): $\mathbb{P}\left( \mathbb{E} \left[ \mathbbm{1}[ z^* \not\in G_t ] \le \delta \right] \right) \ge 1-\epsilon$
    • Practical goal (consistency): $\mathbb{P}\left( \mathbb{E} \left[ \mathbbm{1}[ G_t \ne G_T ] \le \delta \right] \right) \ge 1-\epsilon$
  2. Learn then Test (LTT) for Calibration: The decision to stop is calibrated using the Learn then Test (LTT) framework. This framework views hyperparameter selection (here, a threshold λ\lambda for a surrogate function ff) as a multiple hypothesis testing problem.
    • For a set of candidate thresholds Λ={λ1,,λm}\Lambda = \{\lambda_1, \dots, \lambda_m\}, each λj\lambda_j is associated with a null hypothesis Hj:E[R(yt)>δ]H_j : \mathbb{E} [R(y_t) > \delta], where R(yt)R(y_t) is the risk of stopping at yty_t.
    • Using a calibration dataset Dcal\mathcal{D}_\text{cal}, p-values pjp_j are computed for each HjH_j. The binomial tail bound p-value is used: pλBT:=P(Binom(n,ϵ)nR^n(λ))p_\lambda^{BT} := \mathbb{P}( \text{Binom}(n, \epsilon) \le n \hat R_n(\lambda) ).
    • A family-wise error rate (FWER) controlling algorithm (fixed sequence testing in this paper) selects the smallest λj1\lambda_{j-1} for which pj>ϵp_j > \epsilon, ensuring the risk control guarantee.

Estimating Empirical Risk with Probes

Since directly observing GtG_t or zz^* during inference is hard, lightweight probes operating on the LLM's hidden representations are used to estimate surrogate functions ff.

  1. fcorrect(yt)f_\text{correct}(y_t): Predicts the probability that the LLM will answer correctly given thoughts yty_t.
    • Risk $R_\text{correct}(y_t) := \mathbbm{1}\{\text{LLM is correct}\} \cdot (1 - f_\text{correct}(y_t)) + \mathbbm{1}\{\text{LLM is wrong}\} \cdot f_\text{correct}(y_t)$.
    • Drawback: Requires labeled data for correctness and assumes problems are solvable.
  2. fconsistent(yt)f_\text{consistent}(y_t): Predicts the probability that the final answer ztz_t (derived from yty_t) is the same as the answer zTz_T (derived from thoughts generated up to the maximum budget TT).
    • Risk $R_\text{consistent}(y_t) := \mathbbm{1}\{\text{consistent}\} \cdot (1 - f_\text{consistent}(y_t)) + \mathbbm{1}\{\text{inconsistent}\} \cdot f_\text{consistent}(y_t)$.
    • Advantage: Does not require correctness labels and applies to intractable problems.
  3. fnovel leaf(yt)f_\text{novel leaf}(y_t): Estimates if the current thought y(t)y^{(t)} is a novel leaf in the reasoning tree. The paper actually formulates it as detecting non-novel leaves: P(y(t) is leaf)(1P(y(t) is novel))\mathbb{P}(y^{(t)} \text{ is leaf}) \cdot (1 - \mathbb{P}(y^{(t)} \text{ is novel})).
    • Risk Rnovel leaf(yt)R_\text{novel leaf}(y_t) is defined similarly, using consistency labels for easier verification.
    • Captures the idea that the LLM might be reiterating information.

Implementation Details

  • Thought Segmentation: Reasoning trajectories yy are split into steps y(i)y^{(i)} using \n\n delimiters that also contain "wait" or "but".
  • Step Representation: For each step y(i)y^{(i)}, the mean of the last-layer hidden representations of its tokens is taken, followed by PCA to d=256d=256 dimensions.
  • Probe Architecture: Linear probes are trained on these step-level representations. Probabilities are averaged over a window of 10 steps for smoothness before calibration. (The appendix notes experiments with MLPs and Transformers, but linear probes were chosen for simplicity and to avoid overfitting on limited training data.)
  • Ground Truth Labeling for Probes: A separate LLM (Qwen 3 32B) is prompted to generate labels for correctness, consistency, leaf identification, and novelty.
    • Correctness: Truncate thoughts, prompt for final answer, compare to ground truth.
    • Consistency: Compare intermediate answers ztz_t to the maximum budget answer zTz_T.
    • Leaf: Ask LLM if step y(i)y^{(i)} attempts to answer the original question.
    • Novelty: Provide previous thoughts y(1)y(i1)y^{(1)}\dots y^{(i-1)} and ask if y(i)y^{(i)} adds new information.
  • Evaluation: Final answer correctness is evaluated using the GPT 4.1 API.
  • Models: DeepSeek-R1 distilled Qwen 2.5 32B, Llama 3.3 70B, and QwQ 32B.
  • Software: vLLM and lmdeploy for running LLMs.

Pseudocode for Thought Calibration during Inference:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
function run_thought_calibration(problem_x, LLM, calibrated_threshold_lambda, probe_f, max_steps_T):
  thoughts_y = []
  hidden_states_h = []
  for t in 1 to max_steps_T:
    next_thought_y_t, next_hidden_state_h_t = LLM.generate_next_thought(problem_x, thoughts_y)
    thoughts_y.append(next_thought_y_t)
    hidden_states_h.append(next_hidden_state_h_t)

    // Preprocess hidden state for probe (e.g., mean pooling, PCA)
    processed_h_t = preprocess(next_hidden_state_h_t)

    // Get probe score (potentially smoothed over a window)
    probe_score = probe_f(processed_h_t, history=hidden_states_h) // or just probe_f(processed_h_t)

    if probe_score >= calibrated_threshold_lambda:
      stop_thinking = True
      break
  
  final_answer_z = LLM.synthesize_answer(problem_x, thoughts_y)
  return final_answer_z, thoughts_y

Experimental Results

  1. In-distribution (s1K-1.1 dataset):

    • Thought calibration (especially "Consistent" and "Novel Leaf" probes) preserved model performance with up to a 60% reduction in thinking tokens.
    • The "Supervised" (correctness-based) probe was poorly calibrated, likely because the test set may contain unsolvable problems, making it overconfident.
    • "Consistent" and "Novel Leaf" probes were generally well-calibrated, especially for error levels ϵ<0.1\epsilon < 0.1.

    s1K-1.1 Results Figure 1: Performance on in-distribution data (s1K-1.1). Thought calibration variants (Consistent, Leaf Novelty) reduce tokens significantly while maintaining accuracy.

  2. Generalization (AIME 24, GPQA Diamond, MATH-500):

    • Probes trained on s1K-1.1 were applied to these OOD datasets.
    • Achieved up to a 20% reduction in thinking tokens.
    • The "Consistent" probe generalized better than "Supervised" in terms of efficiency and calibration, meeting theoretical guarantees while "Supervised" remained overconfident.
    • On AIME 24, slight performance gains were observed, possibly by trimming distracting thoughts.

    OOD Results Figure 2: Performance on out-of-distribution datasets. "Consistent" probe maintains efficiency and calibration better than "Supervised".

  3. Analysis:

    • Thought calibration tends to terminate early more often for problems the LLM cannot solve even with full budget, suggesting it identifies when the model is stuck.
    • The method provides input-dependent token reduction, unlike naive cropping.
    • Qualitative examples show the "Consistent" probe's confidence aligning with the LLM's own backtracking and convergence on an answer.

    Example Trajectory Figure 3: Example of "Consistency" probe confidence during a reasoning trajectory. Confidence drops on backtracking and increases upon re-converging to the answer.

Practical Implications and Applications

  • Reduced Inference Costs: Dynamically stopping generation significantly reduces the number of tokens processed, leading to lower computational requirements and latency for LLM reasoning tasks.
  • Adaptive Resource Allocation: The method allows the model to "think more" for harder problems and "less" for easier ones, optimizing resource use.
  • Deploying Resource-Intensive Models: Makes it more feasible to deploy large reasoning models in applications where inference budget is a constraint.
  • User Experience: Faster responses for problems that don't require extensive reasoning.

Limitations

  • Calibration Data Dependency: LTT guarantees are valid if the calibration data is representative of the application data.
  • Probe Simplicity: Linear probes were used due to small datasets. More complex probes might offer better performance with more training data but could also overfit.
  • Scope: Addresses early exiting from reasoning, not the broader problem of steering or guiding the reasoning process itself.

In summary, "Thought calibration" offers a principled and empirically effective approach to make LLM reasoning more efficient by learning when to stop the generation of intermediate thoughts. It leverages lightweight, learnable probes and statistical calibration (LTT) to achieve significant token reduction while preserving performance, with practical benefits for deploying LLMs in real-world scenarios. The "Consistent" probe, which checks if the current answer is likely to be the final one, shows particular promise due to its good performance and lack of need for correctness labels.