Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Published 4 Nov 2025 in cs.SE and cs.AI | (2511.02197v1)

Abstract: With the widespread application of LLMs in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates how intrinsic and post-hoc confidence calibration via Platt Scaling reduces ECE and BS, achieving up to a 50% improvement over baseline metrics.
It systematically compares various LLMs, revealing that model scale and prompt optimization yield significant but task-sensitive gains in code reasoning reliability.
Findings highlight that even top-performing models need calibration adjustments to mitigate overconfidence in complex reasoning tasks.

Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Motivation and Context

This paper addresses the critical question of how well LLMs quantify the reliability of their predictions in code reasoning tasks. As LLMs are increasingly leveraged for code review, debugging, and test generation, quantifiable self-assessment (confidence estimation) becomes vital to inform developer trust and downstream decision-making. Existing approaches for evaluating LLMs in code-related domains primarily rely on objective correctness metrics, neglecting the calibration and interpretability of reported confidence. Systematic empirical analysis of confidence reliability in code reasoning, especially the mechanisms for improving it, has been scarce.

Methodological Framework

The authors introduce a comprehensive analytical framework, evaluating both raw (intrinsic) and processed (prompt-optimized and mathematically-calibrated) confidence outputs from mainstream LLMs. The core methodology involves:

Dataset Coverage: Use of REval and CRUXEval, which probe distinct code reasoning capabilities across input-output prediction, execution path analysis, branch coverage, and variable state reasoning.
Model Selection: Side-by-side assessment of DeepSeek-Reasoner, DeepSeek-Chat, GPT-3.5-Turbo, and Qwen3 series, covering a spectrum of model scales and open-source vs closed-source variants.
Confidence Extraction: Models are prompted for both answers and associated confidence, under (a) default settings, (b) modified prompt strategies (reassess and reflective paradigms), and (c) post-hoc calibration via Platt Scaling.
Evaluation Metrics: Reliability is analyzed using Expected Calibration Error (ECE), Brier Score (BS), and a bespoke Performance Score (PS), which quantifies predictive intelligence relative to naive baseline guessing.

Core Experimental Findings

Intrinsic Confidence and Model Comparison

Analysis reveals pronounced disparities in confidence reliability across models and tasks:

Reasoning-ability advantage: DeepSeek-Reasoner consistently outperforms other LLMs, often achieving more than 50% reduction in ECE compared to DeepSeek-Chat and Qwen3-1.7B, and positive PS where other models fall below baseline guessing.
Model scale effects: Within the Qwen3 family, larger models show modest gains in calibration and intelligence on certain subtasks, but performance plateaus, particularly for non-trivial reasoning objectives.
Closed-source lag: GPT-3.5-Turbo exhibits high ECE and BS, and negative PS scores—contradicting prior assumptions about mainstream closed-source models' robustness in confidence estimation.
Task complexity sensitivity: Calibration degrades markedly for tasks involving branch paths or intermediate state prediction relative to input/output prediction, indicating greater uncertainty in complex reasoning contexts.

Prompt Strategy Optimization

Prompt engineering using reassess/self-doubt and independent reflective agents yields marginal, model-dependent gains:

Selective improvement: Only high-performing models realize tangible increases in calibration and discriminative intelligence through prompt-based strategies. For lower-performing or less specialized models, improvements are inconsistent or negligible.
Overconfidence mitigation: Reassess strategies can reduce overconfident reporting, but reflective confidence sometimes introduces novel subjectivity, shifting calibration errors rather than correcting them.
Task- and model-specific utility: Effectiveness varies by scenario, with the most prominent gains observed for output prediction on CRUXEval tasks using top-tier LLMs.

Mathematical Calibration via Platt Scaling

Platt Scaling demonstrably provides robust global improvement in calibration:

Significant reduction in bias and error: Across all models and all tasks, post-hoc calibration reliably drives ECE and BS lower, and shifts PS from strongly negative (worse than baseline) to near-zero or positive values.
Distributional convergence: Calibration unifies the outputs of intrinsic, reassess, and reflective confidence methods, with diminished differences post-scaling.
Enhanced practical trustworthiness: Even for models with initially poor calibration, systematic bias is rectified (e.g., GPT-3.5-Turbo shows PS approaching zero post-calibration).

Limitations of Mathematical Calibration

Despite the clear benefits, calibration tends to contract the range of outputted confidences, reducing the granularity with which models distinguish high- and low-confidence predictions. In high-risk settings, this can obscure meaningful risk stratification. Moreover, global calibration neglects local prediction idiosyncrasies and struggles under distribution shifts between calibration data and deployment contexts.

Implications and Prospects

Theoretical Advancement

The study elucidates the positive role of explicit reasoning mechanisms and model scale in code reasoning confidence, and exposes the necessity of mathematical calibration as a core step toward usable confidence reporting in LLMs. Prompt strategy optimization offers auxiliary gains, but lacks universality.

Practical Deployment

These insights are significant for engineering adoption: practitioners can select and calibrate LLMs for development workflows, deploy trusted code assistants, and structure dynamic risk management pipelines. Hybrid strategies (combining optimized prompting with Platt Scaling) can substantially increase the reliability of code analysis outputs, supporting integration of LLMs into safety-critical decision loops.

Future Directions

Open issues highlighted include:

Retention of discriminative power: Methods are needed to preserve the granularity of calibrated confidence, possibly via local calibration or adaptive interval techniques.
Generalization across models: Expansion to code-specialized LLMs (e.g., CodeLlama, StarCoder) may uncover distinct confidence phenomena and optimization opportunities.
Prompt robustness: Deeper mapping of the prompt engineering space can yield more generalizable and effective confidence extraction strategies.
Dataset limitations: Alignment between benchmark evaluations and real-world deployment remains an open challenge, with further research needed to address task heterogeneity and requirements ambiguity.

Conclusion

This paper presents a systematic empirical evaluation and enhancement of confidence reliability in LLM-driven code reasoning. Reasoning-capable and large-scale models demonstrate the best calibration and discriminative ability, but even these require post-hoc mathematical calibration to meet practical reliability standards. Prompt strategy optimization provides useful, but limited, auxiliary improvements. The field must now contend with balancing overall reliability, interpretability, and risk discrimination, while pushing toward dynamic calibration strategies and full integration of trustworthy confidence mechanisms in real-world software engineering deployments.