- The paper demonstrates how intrinsic and post-hoc confidence calibration via Platt Scaling reduces ECE and BS, achieving up to a 50% improvement over baseline metrics.
- It systematically compares various LLMs, revealing that model scale and prompt optimization yield significant but task-sensitive gains in code reasoning reliability.
- Findings highlight that even top-performing models need calibration adjustments to mitigate overconfidence in complex reasoning tasks.
Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
Motivation and Context
This paper addresses the critical question of how well LLMs quantify the reliability of their predictions in code reasoning tasks. As LLMs are increasingly leveraged for code review, debugging, and test generation, quantifiable self-assessment (confidence estimation) becomes vital to inform developer trust and downstream decision-making. Existing approaches for evaluating LLMs in code-related domains primarily rely on objective correctness metrics, neglecting the calibration and interpretability of reported confidence. Systematic empirical analysis of confidence reliability in code reasoning, especially the mechanisms for improving it, has been scarce.
Methodological Framework
The authors introduce a comprehensive analytical framework, evaluating both raw (intrinsic) and processed (prompt-optimized and mathematically-calibrated) confidence outputs from mainstream LLMs. The core methodology involves:
- Dataset Coverage: Use of REval and CRUXEval, which probe distinct code reasoning capabilities across input-output prediction, execution path analysis, branch coverage, and variable state reasoning.
- Model Selection: Side-by-side assessment of DeepSeek-Reasoner, DeepSeek-Chat, GPT-3.5-Turbo, and Qwen3 series, covering a spectrum of model scales and open-source vs closed-source variants.
- Confidence Extraction: Models are prompted for both answers and associated confidence, under (a) default settings, (b) modified prompt strategies (reassess and reflective paradigms), and (c) post-hoc calibration via Platt Scaling.
- Evaluation Metrics: Reliability is analyzed using Expected Calibration Error (ECE), Brier Score (BS), and a bespoke Performance Score (PS), which quantifies predictive intelligence relative to naive baseline guessing.
Core Experimental Findings
Intrinsic Confidence and Model Comparison
Analysis reveals pronounced disparities in confidence reliability across models and tasks:
- Reasoning-ability advantage: DeepSeek-Reasoner consistently outperforms other LLMs, often achieving more than 50% reduction in ECE compared to DeepSeek-Chat and Qwen3-1.7B, and positive PS where other models fall below baseline guessing.
- Model scale effects: Within the Qwen3 family, larger models show modest gains in calibration and intelligence on certain subtasks, but performance plateaus, particularly for non-trivial reasoning objectives.
- Closed-source lag: GPT-3.5-Turbo exhibits high ECE and BS, and negative PS scores—contradicting prior assumptions about mainstream closed-source models' robustness in confidence estimation.
- Task complexity sensitivity: Calibration degrades markedly for tasks involving branch paths or intermediate state prediction relative to input/output prediction, indicating greater uncertainty in complex reasoning contexts.
Prompt Strategy Optimization
Prompt engineering using reassess/self-doubt and independent reflective agents yields marginal, model-dependent gains:
- Selective improvement: Only high-performing models realize tangible increases in calibration and discriminative intelligence through prompt-based strategies. For lower-performing or less specialized models, improvements are inconsistent or negligible.
- Overconfidence mitigation: Reassess strategies can reduce overconfident reporting, but reflective confidence sometimes introduces novel subjectivity, shifting calibration errors rather than correcting them.
- Task- and model-specific utility: Effectiveness varies by scenario, with the most prominent gains observed for output prediction on CRUXEval tasks using top-tier LLMs.
Mathematical Calibration via Platt Scaling
Platt Scaling demonstrably provides robust global improvement in calibration:
- Significant reduction in bias and error: Across all models and all tasks, post-hoc calibration reliably drives ECE and BS lower, and shifts PS from strongly negative (worse than baseline) to near-zero or positive values.
- Distributional convergence: Calibration unifies the outputs of intrinsic, reassess, and reflective confidence methods, with diminished differences post-scaling.
- Enhanced practical trustworthiness: Even for models with initially poor calibration, systematic bias is rectified (e.g., GPT-3.5-Turbo shows PS approaching zero post-calibration).
Limitations of Mathematical Calibration
Despite the clear benefits, calibration tends to contract the range of outputted confidences, reducing the granularity with which models distinguish high- and low-confidence predictions. In high-risk settings, this can obscure meaningful risk stratification. Moreover, global calibration neglects local prediction idiosyncrasies and struggles under distribution shifts between calibration data and deployment contexts.
Implications and Prospects
Theoretical Advancement
The study elucidates the positive role of explicit reasoning mechanisms and model scale in code reasoning confidence, and exposes the necessity of mathematical calibration as a core step toward usable confidence reporting in LLMs. Prompt strategy optimization offers auxiliary gains, but lacks universality.
Practical Deployment
These insights are significant for engineering adoption: practitioners can select and calibrate LLMs for development workflows, deploy trusted code assistants, and structure dynamic risk management pipelines. Hybrid strategies (combining optimized prompting with Platt Scaling) can substantially increase the reliability of code analysis outputs, supporting integration of LLMs into safety-critical decision loops.
Future Directions
Open issues highlighted include:
- Retention of discriminative power: Methods are needed to preserve the granularity of calibrated confidence, possibly via local calibration or adaptive interval techniques.
- Generalization across models: Expansion to code-specialized LLMs (e.g., CodeLlama, StarCoder) may uncover distinct confidence phenomena and optimization opportunities.
- Prompt robustness: Deeper mapping of the prompt engineering space can yield more generalizable and effective confidence extraction strategies.
- Dataset limitations: Alignment between benchmark evaluations and real-world deployment remains an open challenge, with further research needed to address task heterogeneity and requirements ambiguity.
Conclusion
This paper presents a systematic empirical evaluation and enhancement of confidence reliability in LLM-driven code reasoning. Reasoning-capable and large-scale models demonstrate the best calibration and discriminative ability, but even these require post-hoc mathematical calibration to meet practical reliability standards. Prompt strategy optimization provides useful, but limited, auxiliary improvements. The field must now contend with balancing overall reliability, interpretability, and risk discrimination, while pushing toward dynamic calibration strategies and full integration of trustworthy confidence mechanisms in real-world software engineering deployments.
References
- (2511.02197) (the reviewed paper)