Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 108 tok/s
GPT OSS 120B 473 tok/s Pro
Kimi K2 218 tok/s Pro
2000 character limit reached

DeepConf: Confidence-Based LLM Reasoning

Updated 22 August 2025
  • DeepConf is a test-time reasoning framework that uses localized token-level confidence measurements to filter low-quality reasoning paths.
  • It aggregates intermediate uncertainties in offline mode for weighted trace voting, achieving near-SOTA accuracy on challenging benchmarks.
  • In online mode, early termination based on group confidence reduces token generation by up to 84.7%, enhancing efficiency in LLM serving pipelines.

Deep Think with Confidence (DeepConf) is a test-time reasoning framework for LLMs that leverages internal model confidence signals to improve both the efficiency and accuracy of multi-step generation on reasoning-intensive tasks. Unlike majority-voting self-consistency, which treats all sampled reasoning paths equally and suffers from diminishing returns and quadratic compute costs at scale, DeepConf introduces localized statistical filtering to identify, de-emphasize, or halt low-confidence traces dynamically. This approach requires no model retraining or custom hyperparameters and can be integrated directly into serving pipelines for open-source LLMs.

1. Localized Confidence-Driven Reasoning

DeepConf fundamentally departs from standard self-consistency by aggregating intermediate token-level uncertainties generated during autoregressive reasoning. At each token position ii in a reasoning trace, the LLM emits a probability vector PiP_i, and the negative log-likelihood (or entropy) is used as a proxy for confidence: Hi=jPi(j)logPi(j)(token entropy)H_i = -\sum_j P_i(j) \log P_i(j) \qquad \text{(token entropy)}

Ci=1kj=1klogPi(j)C_i = -\frac{1}{k} \sum_{j=1}^{k} \log P_i(j)

where CiC_i is the mean negative log-probability of the top-kk tokens. These token-level confidences are then aggregated over overlapping sliding windows ("groups") throughout the generation: CGs=1GstGsCtC_{G_s} = \frac{1}{|G_s|} \sum_{t \in G_s} C_t where GsG_s denotes a window of nn consecutive tokens, advancing in stride $1$.

Critically, DeepConf uses the localized minima over such windows ("bottom 10% confidence," "lowest group confidence," etc.) rather than global average confidence, allowing the system to identify transient drops that often correspond to local reasoning errors ("error islands"). This enables both post hoc and real-time quality assessment of reasoning traces far more sensitively than holistic trace scoring.

2. Offline and Online Integration Modes

DeepConf provides two primary operational modes for leveraging confidence signals in LLM inference:

  • Offline Mode: The model generates full-length reasoning traces in parallel. Each trace is then retrospectively filtered—either downweighted or excluded—based on aggregated local confidence (such as the mean of bottom 10% group confidences). Final answer selection is performed via weighted majority voting, where the vote for candidate answer aa is

V(a)=tTCtI{answer(t)=a}V(a) = \sum_{t \in T} C_t \cdot \mathbb{I}\{\text{answer}(t) = a\}

Only the top-η\eta percentile of traces (in confidence) may be included in voting, with η\eta typically set to 90% or 10% depending on the desired conservatism.

  • Online Mode (Early Termination): During autoregressive decoding, every freshly generated group GsG_s is checked against a threshold ss set in a brief warmup phase. If CGs<sC_{G_s} < s, the in-progress trace is terminated immediately, eliminating resource expenditure on reasoning paths likely to yield low-quality answers. This mode can reduce total generated tokens by up to 84.7% compared to standard parallel decoding at equivalent voting budgets.

The calibration of the threshold ss and filtering percentile η\eta is robust and can be set using as few as 16 warmup traces per deployment context.

3. Quantitative Gains and Scaling Properties

On competitive reasoning tasks such as AIME 2025, DeepConf in offline mode (DeepConf@512) achieves up to 99.9% accuracy with models such as GPT-OSS-120B, considerably exceeding the 97.0% attainable through conventional self-consistency majority voting given the same answer ensemble size. Across tested open-weight LLMs (Qwen3, DeepSeek, GPT-OSS) and datasets (AIME, GPQA, HMMT), the method exhibits:

Mode Max Accuracy Token Reduction vs. Self-Consistency Integration Cost
Offline 99.9% 0% Minimal (API)
Online Competitive Up to 84.7% Minimal (API)

Most notably, DeepConf's accuracy scaling curve does not saturate as quickly as baseline methods: additional sampling budget leads to further improvements, and aggressive low-confidence filtering avoids the accumulation of "confidently wrong" contributions in the vote.

In contrast to self-consistency, where KK traces are syntactically sampled and all votes are equally weighted, DeepConf prioritizes traces on the basis of model internal uncertainty. This shift is particularly impactful in tasks where model errors are locally concentrated (i.e., a critical step is "forgotten" or misapplied partway through reasoning), as global confidence would otherwise mask these issues.

Other methods, such as confidence re-ranking based on a global likelihood or ensemble agreement, lack the granularity to perform real-time filtering or early termination. DeepConf also avoids the quadratic cost of ensemble cross-verification and depends solely on the standard output of the LLM's softmax layer at each step, requiring neither external retrievers nor custom reward models.

Potential drawbacks include the risk that highly confident traces may be confidently incorrect, especially in domains with pathological calibration. However, empirical results across mathematical and logical reasoning show that DeepConf's parameter-free filtering is robust; reasonable default thresholds work consistently well without per-task tuning.

5. Formal Confidence Filtering and Voting Procedures

DeepConf employs mathematically explicit criteria for filtering and aggregation:

  • Trace Voting: Weighted voting using per-trace group (or tail) confidence.
  • Thresholding: For online early-out, a trace is terminated if at any step CGs<sC_{G_s}<s, with ss selected statistically from the minimum confidence over a set of pre-generated traces.
  • Aggregated Decision: Final answer

a^=argmaxaV(a),V(a)=tCtI[answer(t)=a]\hat{a} = \arg\max_{a} V(a), \quad V(a) = \sum_{t} C_t\, \mathbb{I}[\text{answer}(t)=a]

This ensures that only reasoning paths the model managed with high local certainty contribute materially to the system output.

6. Practical Applications and Model Integration

DeepConf is designed for seamless adoption in both batch and online LLM serving frameworks. Its lightweight modification—operating entirely at the inference level using readily available logit/probability information—removes barriers of retraining, rearchitecture, or extensive per-task validation. The method is broadly applicable to:

  • Mathematical and logical reasoning (e.g., AIME, HMMT, GPQA).
  • Production environments requiring strict latency and compute management (online mode).
  • Ensemble-based decision protocols where explainable weighting and trace auditing are critical.
  • Scenarios where calibration is vital, as local group confidence surfaces both statistical and semantic uncertainties in the LLM-generated rationales.

7. Context, Generality, and Impact

DeepConf unifies and generalizes several test-time efficiency paradigms under a principled, per-token statistical regime. By tracking transient uncertainty within each trace, the system robustly avoids both the token waste endemic to naive self-consistency and the accuracy degeneration seen in heuristics that disregard internal model confidence. The mechanism is model-agnostic, applies generically to any decoder-style LLM, and aligns with recent research demonstrating the necessity of local uncertainty modeling for reliable output selection and risk-aware deployment.

In summary, DeepConf represents a significant advance in test-time scaling for LLM reasoning. Its explicit use of localized confidence measurements for both trace filtering and early termination achieves near-SOTA accuracy and resource savings on competitive reasoning benchmarks. The approach offers a scalable, model-agnostic framework for deploying LLMs with higher reliability and cost efficiency—key considerations for academic, industrial, and safety-critical systems.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube