Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Deep Think with Confidence (2508.15260v1)

Published 21 Aug 2025 in cs.LG

Abstract: LLMs have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.

Collections

Summary

The paper introduces DeepConf, a method that leverages token-level confidence metrics to filter out low-quality reasoning traces for improved accuracy and efficiency.
It employs both offline confidence-weighted voting and online early-stopping methods, achieving up to 99.9% accuracy and an 84.7% reduction in token usage.
The approach is model-agnostic and easily deployed, requiring no additional training while utilizing internal log-probabilities to assess trace quality.

Deep Think with Confidence: Confidence-Aware Test-Time Reasoning for LLMs

Introduction

"Deep Think with Confidence" introduces DeepConf, a test-time method for improving both the efficiency and accuracy of LLMs on complex reasoning tasks. The method leverages model-internal confidence signals to filter out low-quality reasoning traces, either during (online) or after (offline) generation, without requiring any additional training or hyperparameter tuning. DeepConf is evaluated on a suite of challenging mathematical and STEM reasoning benchmarks using state-of-the-art open-source models, demonstrating substantial improvements in both accuracy and computational efficiency over standard self-consistency with majority voting.

Figure 1: DeepConf on AIME 2025 (top) and parallel thinking using DeepConf (bottom).

Motivation and Background

Self-consistency with majority voting—sampling multiple reasoning paths and aggregating answers—has become a standard approach for boosting LLM reasoning accuracy. However, this method incurs significant computational overhead, as the number of generated traces scales linearly with the desired accuracy. Moreover, majority voting treats all traces equally, ignoring the substantial variance in trace quality, which can lead to suboptimal or even degraded performance as the number of traces increases.

Recent work has explored using token-level statistics (e.g., entropy, confidence) to estimate the quality of reasoning traces. While global confidence measures (e.g., average trace confidence) can distinguish correct from incorrect traces, they obscure local reasoning failures and require full trace generation, precluding early termination of low-quality traces.

DeepConf: Confidence-Aware Reasoning

DeepConf addresses these limitations by introducing local, group-based confidence metrics and integrating them into both offline and online reasoning workflows.

Confidence Metrics

Token Entropy: $H_i = -\sum_j P_i(j) \log P_i(j)$ , where $P_i(j)$ is the probability of the $j$ -th token at position $i$ .
Token Confidence: $C_i = -\frac{1}{k} \sum_{j=1}^k \log P_i(j)$ , the negative average log-probability of the top- $k$ tokens.
Average Trace Confidence: $C_{\text{avg}} = \frac{1}{N} \sum_{i=1}^N C_i$ .
Group Confidence: Sliding window average of token confidence over $n$ tokens.
Bottom 10% Group Confidence: Mean of the lowest 10% group confidences in a trace.
Lowest Group Confidence: Minimum group confidence in a trace.
Tail Confidence: Mean confidence over the final segment (e.g., last 2048 tokens) of a trace.

Figure 2: Confidence distributions for correct vs. incorrect reasoning traces across different metrics.

Empirically, bottom 10% and tail confidence metrics provide better separation between correct and incorrect traces than global averages, indicating their utility for trace quality estimation.

Figure 3: Visualization of confidence measurements and offline thinking with confidence.

Offline DeepConf

In the offline setting, all traces are generated before aggregation. DeepConf applies confidence-weighted majority voting, optionally filtering to retain only the top $\eta\%$ of traces by confidence. This approach can be instantiated with any of the above confidence metrics.

Confidence-Weighted Voting: Each trace's vote is weighted by its confidence score.
Confidence Filtering: Only the top $\eta\%$ of traces (by confidence) are retained for voting.
Figure 4: Offline accuracy with Lowest Group Confidence filtering (DeepSeek-8B) on multiple datasets.

Online DeepConf

In the online setting, DeepConf enables early termination of low-confidence traces during generation, reducing unnecessary computation. The method uses a warmup phase to calibrate a confidence threshold, then halts traces whose group confidence falls below this threshold. Adaptive sampling is used to dynamically adjust the number of traces based on consensus among generated answers.

Figure 5: DeepConf during online generation.

Experimental Results

Offline Evaluations

Across five open-source models and five challenging benchmarks, DeepConf consistently outperforms standard majority voting. For example, on AIME 2025 with GPT-OSS-120B, DeepConf@512 achieves 99.9% accuracy (vs. 97.0% for majority voting) and similar gains are observed for smaller models and other datasets. Aggressive filtering (top 10%) yields the largest improvements, but can occasionally hurt performance due to overconfident errors; conservative filtering (top 90%) is safer but less aggressive.

Online Evaluations

Online DeepConf achieves substantial token savings—up to 84.7% reduction—while maintaining or improving accuracy. For instance, on AIME 2025 with GPT-OSS-120B, DeepConf-low reduces token usage from $3.23 \times 10^8$ to $0.49 \times 10^8$ with a slight accuracy gain (97.1% to 97.9%). The method is robust across model scales and datasets.

Figure 6: Generated tokens comparison across different tasks based on GPT-OSS-120B.

Figure 7: Scaling behavior: Model accuracy vs. voting size for different methods using offline DeepConf.

Figure 8: Scaling behavior: Model accuracy vs. token cost for different methods using online DeepConf.

Figure 9: Scaling behavior: Model accuracy vs. budget size for different methods on GPQA-Diamond.

Figure 10: Scaling behavior: Model accuracy vs. token cost for different methods on GPQA-Diamond.

Implementation and Deployment

DeepConf is designed for minimal integration overhead. The method requires only access to token-level log-probabilities and can be implemented with minor modifications to popular inference engines such as vLLM. The online variant requires a sliding window computation of group confidence and a simple early-stopping criterion. The method is compatible with standard OpenAI-compatible APIs and can be toggled per request.

Resource and Scaling Considerations

Computational Savings: DeepConf achieves up to 85% reduction in generated tokens at fixed accuracy, enabling significant cost savings in large-scale deployments.
Model-Agnostic: The method is effective across a wide range of model sizes (8B–120B) and architectures.
No Training Required: DeepConf operates entirely at test time, requiring no additional model training or fine-tuning.

Limitations and Future Directions

A key limitation is the potential for overconfident but incorrect traces to dominate the filtered ensemble, particularly under aggressive filtering. Future work should address confidence calibration and explore integration with uncertainty quantification methods. Extending DeepConf to reinforcement learning settings, where confidence-based early stopping could guide exploration, is a promising direction.

Conclusion

DeepConf provides a practical, scalable approach for improving both the efficiency and accuracy of LLM reasoning via confidence-aware filtering and early stopping. The method is simple to implement, model-agnostic, and delivers strong empirical gains across diverse reasoning tasks and model scales. These results underscore the value of leveraging model-internal confidence signals for test-time reasoning optimization and suggest broad applicability in real-world LLM deployments.