Hierarchical Platt Scaling in Automated Code Revision

Updated 23 April 2026

Hierarchical Platt Scaling is a two-stage post-hoc calibration approach that fuses token-level signal extraction with global logistic regression for reliable confidence estimation in code revision tasks.
It leverages fine-grained token scores—such as minimum token probability, low-K averages, and attention-weighted uncertainty—to capture local uncertainties in predictions.
Empirical results on multiple ACR benchmarks show reduced Expected Calibration Error and increased bin coverage, validating its practical impact for program and vulnerability repair.

Hierarchical Platt scaling is a two-stage post-hoc confidence calibration approach for LLMs in automated code revision (ACR) tasks. It addresses the limitations of conventional sequence-level Platt-scaling by incorporating fine-grained, token-level signal extraction combined with local and global calibration layers. This methodology yields calibrated instance-level confidence scores that more faithfully reflect ground-truth correctness, which is critical for downstream tasks such as program repair, vulnerability repair, and code refinement (Lin et al., 8 Apr 2026).

1. Platt-scaling Formalism

Platt-scaling refers to the application of a logistic regression model to raw model scores to produce calibrated probabilities. Given a score $s$ (such as a logit or model-produced likelihood) for an output $\hat{y}$ and binary ground-truth label $y\in\{0,1\}$ , Platt-scaling fits the following transformation:

$\hat{p}(y=1 \mid s) = \sigma(A s + B)$

where $\sigma(z) = 1/(1+\exp(-z))$ , and parameters $A, B \in \mathbb{R}$ are estimated by minimizing regularized negative log-likelihood over a held-out calibration set:

$(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)$

with L2-regularization parameter $\lambda \ge 0$ (Lin et al., 8 Apr 2026).

2. Fine-grained Confidence Scores

Hierarchical Platt scaling leverages token-level softmax traces from the autoregressive decoding process to compute confidence features, rather than relying solely on sequence-level aggregate scores. Three fine-grained scores are extracted for each predicted sequence:

Minimum Token Probability ( $s_{\mathrm{min}}$ ):

$s_{\mathrm{min}} = P_{\mathrm{min}}(\hat{y}) = \min_{1 \le t \le T} P(\hat{y}_t \mid \hat{y}_{<t}, X)$

This feature identifies the least confident token.

Lowest-K Token Probability ( $\hat{y}$ 0):

Let $\hat{y}$ 1 be sorted token probabilities, $\hat{y}$ 2 determined via Kneedle elbow detection:

$\hat{y}$ 3

Attention-weighted Uncertainty ( $\hat{y}$ 4):

Attention rollout is used to propagate downstream attention $\hat{y}$ 5 per token (using Abnar & Zuidema, 2020). Tokens are ranked by $\hat{y}$ 6, and the top $\hat{y}$ 7 are used for averaging:

$\hat{y}$ 8

with $\hat{y}$ 9 the indices of the $y\in\{0,1\}$ 0 largest $y\in\{0,1\}$ 1.

These features enable capture of local uncertainties critical in ACR tasks, where a globally correct-looking sequence may conceal token-level uncertainties indicative of error (Lin et al., 8 Apr 2026).

3. Hierarchical (Two-stage) Calibration Pipeline

The two-stage calibration pipeline comprises:

Stage 1: Local Platt-scaling

Each input–output pair $y\in\{0,1\}$ 2 is embedded into

$y\in\{0,1\}$ 3

where $y\in\{0,1\}$ 4, UMAP reduction is applied to concatenated text embeddings of $y\in\{0,1\}$ 5 and $y\in\{0,1\}$ 6, and $y\in\{0,1\}$ 7 is the scalar score for $y\in\{0,1\}$ 8. Embeddings are clustered using HDBSCAN, yielding $y\in\{0,1\}$ 9 clusters.

For each cluster $\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 0 and score $\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 1, local sigmoid calibrators are fitted:

$\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 2

Outliers use a default backoff $\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 3 (either a global sigmoid or identity).

Stage 2: Global Platt-scaling over Local Outputs

The locally calibrated probability triplet $\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 4 forms the feature vector for a global logistic regression:

$\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 5

The combined formula is:

$\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 6

when $\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 7 falls within cluster $\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 8.

4. Calibration Metrics

Three metrics are utilized to quantify calibration on the test set:

Expected Calibration Error (ECE): Discretizes $\hat{p}(y=1 \mid s) = \sigma(A s + B)$ 9 into $\sigma(z) = 1/(1+\exp(-z))$ 0 equal-width bins. For each bin, $\sigma(z) = 1/(1+\exp(-z))$ 1, $\sigma(z) = 1/(1+\exp(-z))$ 2, $\sigma(z) = 1/(1+\exp(-z))$ 3:

$\sigma(z) = 1/(1+\exp(-z))$ 4

Brier Score ( $\sigma(z) = 1/(1+\exp(-z))$ 5):

$\sigma(z) = 1/(1+\exp(-z))$ 6

Bin Coverage (BC): Number of bins populated with at least one sample (ideal is $\sigma(z) = 1/(1+\exp(-z))$ 7).

These metrics collectively assess both global calibration and the discriminative granularity of the confidence estimator (Lin et al., 8 Apr 2026).

5. Algorithmic Structure and Pseudocode

The pipeline proceeds as follows:

For each fine-grained score type $\sigma(z) = 1/(1+\exp(-z))$ $σ (z) = 1/ (1 + exp (- z))$ 8:
- Compute score $\sigma(z) = 1/(1+\exp(-z))$ 9 for each calibration sample.
- Compute embedded feature $A, B \in \mathbb{R}$ 0.
- Cluster $A, B \in \mathbb{R}$ 1 using HDBSCAN.
- Fit logistic regression calibrators $A, B \in \mathbb{R}$ 2 for each cluster.
- Select backoff $A, B \in \mathbb{R}$ 3 for outliers.
For each calibration sample:
- Compute locally calibrated probabilities $A, B \in \mathbb{R}$ 4 using cluster-appropriate or backoff calibrators.
- Stack $A, B \in \mathbb{R}$ 5 and fit a global logistic regression $A, B \in \mathbb{R}$ 6.
For inference, compute the three fine-grained scores, embed, assign cluster or outlier, calibrate locally, then apply the global logit for final $A, B \in \mathbb{R}$ 7.

Pseudocode precisely formalizing these steps is provided in (Lin et al., 8 Apr 2026).

6. Empirical Results and Recommendations

Extensive evaluation across three ACR benchmarks—DCF-Bug (program repair), DCF-Vul (vulnerability repair), and CR-Trans (code refinement)—using 14 open-source, decoder-only LLMs (Llama-3.1, CodeLlama, Qwen2.5, Qwen2.5-Coder, DeepSeek; 7B–72B params) demonstrates:

Sequence-level Platt-scaling often produces low bin coverage (BC=1–3) and ECE ≳ 0.1, indicating poor calibration granularity.
Hierarchical Platt-scaling with fine-grained scores achieves ECE $A, B \in \mathbb{R}$ 8 and raised BC (≈5–9) for program and vulnerability repair (DCF-Bug, DCF-Vul). Minimum token probability is the most effective feature (lowest ECE, Brier, highest BC).
Local Platt-scaling is essential for code refinement (CR-Trans): It reduces ECE by up to $A, B \in \mathbb{R}$ 9 (down to ≈ $(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)$ 0) and increases BC by +2–5 bins. For DCF-Bug and DCF-Vul, local gains are smaller but consistent ( $(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)$ 1ECE ≈ $(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)$ 2; $(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)$ 3BC ≈ $(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)$ 4).
Practical guidance: For DCF-Bug/DC F-Vul, global Platt + minimum token is sufficient for well-calibrated output. Local Platt-scaling should be used for code refinement, or when further ECE reduction is required (Lin et al., 8 Apr 2026).

7. Significance and Implications

Hierarchical Platt scaling provides a principled and empirically validated framework for confidence calibration in ACR tasks, addressing the shortcomings of global-only approaches. By leveraging local context via token-level features and cluster-specific calibrators, it produces more informative probability outputs. All code, calibrators, and replication scripts are available in the corresponding repository (Lin et al., 8 Apr 2026). This approach enables reliable, instance-level decision-making, facilitating trustworthy integration of LLMs into practical software engineering pipelines.

Markdown Report Issue Upgrade to Chat

References (1)

Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Platt Scaling.