Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agentic Confidence Calibration (ACC)

Updated 25 March 2026
  • Agentic Confidence Calibration (ACC) is a framework that measures AI agents’ self-assessed confidence against empirical outcomes in multi-step, tool-integrated workflows.
  • It employs protocols at pre-, mid-, and post-execution stages to diagnose biases such as overconfidence and confirmation bias during task evaluation.
  • Empirical findings demonstrate that adversarial prompting can significantly reduce calibration errors, enhancing the trustworthiness of AI self-assessment.

Agentic Confidence Calibration (ACC) quantifies and improves the alignment between an AI agent’s self-reported probability of success (its “agentic uncertainty”) and empirical task outcomes, particularly in multi-step, tool-integrated workflows. Unlike classical calibration concepts developed for static, single-turn models, ACC targets the unique epistemic and process-level uncertainties encountered by autonomous agents, including compounding error propagation, information state drift, and confirmation bias during self-evaluation. ACC frameworks are designed to both diagnose systematic miscalibration—most notably pervasive overconfidence in state-of-the-art coding and tool-use agents—and to enable deployment-time protocols and learning-based interventions that yield more trustworthy, actionable agent uncertainty estimates (Kaddour et al., 6 Feb 2026).

1. Foundational Principles and Formal Definition

Agentic Confidence Calibration is formally concerned with the self-assigned probability that an agent will successfully complete a complex task, conditional on the information available at the time of confidence elicitation. Let p^i[0,1]\hat{p}_i \in [0,1] be the agent’s reported probability of success on instance ii, with actual outcome yi{0,1}y_i \in \{0,1\} indicating success or failure. Calibration is assessed by comparing p^i\hat{p}_i to the empirical base rate:

Overconfidence=1ni=1np^i1ni=1nyi\text{Overconfidence} = \frac{1}{n}\sum_{i=1}^n \hat{p}_i - \frac{1}{n}\sum_{i=1}^n y_i

Expected Calibration Error (ECE) and Brier Score are used as principal metrics:

ECE=b=1BBbnacc(Bb)conf(Bb)\mathrm{ECE} = \sum_{b=1}^B\frac{|B_b|}{n}|\mathrm{acc}(B_b) - \mathrm{conf}(B_b)|

Brier=1ni=1n(p^iyi)2\mathrm{Brier} = \frac{1}{n}\sum_{i=1}^n (\hat{p}_i - y_i)^2

where BbB_b is a confidence bin, acc(Bb)\mathrm{acc}(B_b) is empirical accuracy, and conf(Bb)\mathrm{conf}(B_b) is mean predicted confidence within BbB_b (Kaddour et al., 6 Feb 2026).

Agentic calibration generalizes the “probability that I know” (P(IK)P(\mathrm{IK})) concept to dynamic, multi-step workflows by explicitly modeling P(IS)=P(agent M succeeds on tI)P(\mathrm{IS}) = P(\mathrm{agent}\ M\ \text{succeeds on}\ t \mid \mathcal{I}), where I\mathcal{I} denotes the agent’s information state—which can correspond to various stages: pre-execution (task description only), mid-execution (partial trajectory observed), or post-execution (completed solution/prediction) (Kaddour et al., 6 Feb 2026).

2. Calibration Protocols and Experimental Regimes

ACC requires eliciting and evaluating agent confidences at distinct workflow checkpoints to probe informational and cognitive biases. Experimental protocols include:

  • Pre-Execution Elicitation: Confidence assessed with only static task information (description, codebase). Measures a priori beliefs about difficulty before any attempt.
  • Mid-Execution Elicitation: Confidence is queried after 25%, 50%, or 75% of tool-use/actions, providing insight into dynamic uncertainty as trajectory unfolds.
  • Post-Execution Elicitation: After generating a solution (e.g., code patch), the agent estimates the probability of correctness.
  • Adversarial Post-Execution: An enhanced protocol in which the agent is prompted to “find bugs” or critically falsify its solution before providing a final confidence estimate, thereby countering confirmation bias.

All elicited confidences are collected under strictly sandboxed conditions to prevent agents from accessing ground-truth test results, isolating intrinsic uncertainty from retrospective feedback (Kaddour et al., 6 Feb 2026).

Three classes of task-solving LLM-based agents—GPT-5.2 Codex, Gemini-3-Pro, Claude Opus 4.5—have been evaluated on the SWE-bench Pro bug-fixing suite, with detailed calibration metrics computed in each regime.

3. Empirical Findings: Agentic Overconfidence and Adversarial Protocols

Quantitative results demonstrate that leading coding agents are systematically and sometimes extremely overconfident:

  • Gemini-3-Pro claims average success probability p^0.77\hat{p}\approx0.77 yet achieves only 22% empirical success (overconfidence gap: +55 pp).
  • All agents exhibit overconfidence across regimes, but calibration curves lie farther below the diagonal (i.e., greater overconfidence) for post-execution estimates relative to pre-execution.
  • Adversarial prompting—soliciting bug-finding or falsification—yields substantial improvements: ECE reduced by up to 35% (Claude: 0.37→0.24), overconfidence lowered by 15 pp (Gemini: 0.55→0.40).
  • Pre-execution confidences, despite strictly less information, tend to show better AUROC (i.e., superior discrimination between success and failure) than post-execution, with the latter susceptible to confirmation bias (superficial plausibility anchoring) (Kaddour et al., 6 Feb 2026).

Notably, while agent confidence typically declines during mid-execution (“cold feet” effect), this dynamic is uninformative: failures and successes are not cleanly separable by their confidence trajectories (AUROC near chance).

Regime Model Base Rate AUROC Overconf. ECE Brier
Pre-exec GPT-5.2-Codex 35% 0.62 +0.35 0.35 0.33
Post-exec GPT-5.2-Codex 35% 0.58 +0.39 0.42 0.40
Adv-Post GPT-5.2-Codex 35% 0.55 +0.26 0.30 0.31
Pre-exec Gemini-3-Pro 22% 0.53 +0.77 0.77
Post-exec Gemini-3-Pro 22% 0.51 +0.55 0.66
Adv-Post Gemini-3-Pro 22% 0.57 +0.40 0.53

Adversarial prompting not only reduces global confidence but, for some agents, more strongly targets likely failures, thus increasing discrimination (shift-vs-signal analysis). However, not all agents respond identically: certain LLMs (e.g., GPT-5.2-Codex) mainly show a uniform confidence downscaling under adversarial cues, suggesting that post hoc calibration (e.g., Platt scaling) remains necessary for optimal alignment (Kaddour et al., 6 Feb 2026).

4. Mechanistic Interpretations and Design Implications

Analysis explains the paradoxical result that pre-execution estimates outperform post-execution ones in discriminative power. Pre-execution requires holistic reasoning about codebase complexity, error message clarity, and prior statistical difficulty, resulting in more abstract, task-grounded uncertainty. In contrast, post-execution review is susceptible to “plausibility anchoring”: if the generated artifact “looks right,” the agent’s confidence increases despite the lack of ground-truth validation—a manifestation of confirmation bias.

Adversarial prompting interrupts this bias by reframing the agent’s task orientation from solution justification to error discovery, eliciting greater epistemic humility and yielding better-calibrated confidence estimates.

Robust design guidance emerges:

  • Avoid reliance on post-execution self-assessment for high-stakes accept/reject decisions.
  • Prefer pre-execution confidence for early routing or triage.
  • For accept/reject gates at the end of workflows, ensemble approaches (e.g., using the minimum of pre/post confidences) reduce calibration error without loss of discrimination.
  • Mid-execution confidence drops, while reflecting agent “nervousness,” do not reliably aid failure prediction and are better deployed as early warning triggers for human escalation rather than automated intervention (Kaddour et al., 6 Feb 2026).

5. Comparison with Other Calibration Paradigms

Classical calibration and uncertainty quantification methods (e.g., Platt scaling, temperature scaling, isotonic regression) remain partially applicable, especially for post hoc adjustment of confidence scales. However, ACC highlights unique agentic pathologies such as compounding epistemic error, tool-use-induced noise, and confirmation bias that are not addressed by single-turn calibration frameworks (Kaddour et al., 6 Feb 2026).

Recent works further articulate agentic-specific configurations:

  • Multi-agent deliberation and debate can improve calibration by simulating collective rationalization and integrating diverse epistemic perspectives, typically lowering ECE and Brier Score relative to single-agent estimates (Yang et al., 2024).
  • In tool-use settings, calibration strategies must be adapted to the type of tool in use (evidence vs verification), given the “confidence dichotomy” driven by the nature of feedback, with RL-based fine-tuning frameworks now explicitly targeting joint optimization of performance and calibration (Xuan et al., 12 Jan 2026).
  • Process-level calibration frameworks extract token- or step-level features across trajectories to diagnose and improve calibration beyond what is achievable through local (last-step) or scalar verbalized confidences (Zhang et al., 22 Jan 2026).

6. Future Challenges and Research Directions

Key open directions include:

  • Extension of ACC frameworks beyond code and formal engineering domains to partially subjective or ambiguous success criteria (e.g., creative writing, open-domain navigation).
  • Training learned verifier models—including process- and outcome-based reward models—on interaction traces to systematically out-perform prompt-based uncertainty agents.
  • Comprehensive studies of calibration scaling laws, elucidating the effects of model capacity and architecture on agentic miscalibration.
  • Systematic investigation of calibration in hierarchical agentic workflows (e.g., planner–executor–critic architectures), particularly exploring how uncertainty propagates and composes across agent roles and hand-off points (Kaddour et al., 6 Feb 2026).
  • Large-scale task suite evaluation to enable precise discrimination among minor AUROC differences and support statistically rigorous comparisons.

7. Summary Table: ACC Regimes and Key Results

Elicitation Regime Major Cognitive Bias Best Use Calibration Metrics Adversarial Protocol Impact
Pre-Execution Abstract reasoning Early triage, routing Often better AUROC, lower ECE Not applicable
Mid-Execution Cold feet, drift Early escalation (not reliable for auto-intervention) Little discrimination Not analyzed
Post-Execution Confirmation bias Accept/reject (unsafe alone) Worst ECE, overconf. Adversarial prompt improves ECE, AUROC
Adversarial Post-Exec Falsification focus Final error checking, conservative acceptance Reduced ECE, overconf., ↑ AUROC Up to 35% ECE reduction, best discrimination

All entries represent findings for state-of-the-art LLM-based coding agents on real-world benchmarks (Kaddour et al., 6 Feb 2026).


Agentic Confidence Calibration establishes rigorous methodologies and design principles for quantifying and mitigating epistemic miscalibration in autonomous agents. Its diagnostic protocols and adversarial interventions are now essential components for deploying trustworthy, high-stakes AI systems capable of accurate self-assessment. Continued advances, especially in adversarial prompting, process-level representation, and learned verification models, are expected to play a central role in achieving reliable agentic uncertainty under real-world complexity.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agentic Confidence Calibration (ACC).