Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Platt Scaling in Automated Code Revision

Updated 23 April 2026
  • Hierarchical Platt Scaling is a two-stage post-hoc calibration approach that fuses token-level signal extraction with global logistic regression for reliable confidence estimation in code revision tasks.
  • It leverages fine-grained token scores—such as minimum token probability, low-K averages, and attention-weighted uncertainty—to capture local uncertainties in predictions.
  • Empirical results on multiple ACR benchmarks show reduced Expected Calibration Error and increased bin coverage, validating its practical impact for program and vulnerability repair.

Hierarchical Platt scaling is a two-stage post-hoc confidence calibration approach for LLMs in automated code revision (ACR) tasks. It addresses the limitations of conventional sequence-level Platt-scaling by incorporating fine-grained, token-level signal extraction combined with local and global calibration layers. This methodology yields calibrated instance-level confidence scores that more faithfully reflect ground-truth correctness, which is critical for downstream tasks such as program repair, vulnerability repair, and code refinement (Lin et al., 8 Apr 2026).

1. Platt-scaling Formalism

Platt-scaling refers to the application of a logistic regression model to raw model scores to produce calibrated probabilities. Given a score ss (such as a logit or model-produced likelihood) for an output y^\hat{y} and binary ground-truth label y{0,1}y\in\{0,1\}, Platt-scaling fits the following transformation:

p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)

where σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z)), and parameters A,BRA, B \in \mathbb{R} are estimated by minimizing regularized negative log-likelihood over a held-out calibration set:

(A,B)=argminA,Bi=1N[yilogσ(Asi+B)(1yi)log(1σ(Asi+B))]+λ(A2+B2)(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)

with L2-regularization parameter λ0\lambda \ge 0 (Lin et al., 8 Apr 2026).

2. Fine-grained Confidence Scores

Hierarchical Platt scaling leverages token-level softmax traces from the autoregressive decoding process to compute confidence features, rather than relying solely on sequence-level aggregate scores. Three fine-grained scores are extracted for each predicted sequence:

  • Minimum Token Probability (smins_{\mathrm{min}}):

smin=Pmin(y^)=min1tTP(y^ty^<t,X)s_{\mathrm{min}} = P_{\mathrm{min}}(\hat{y}) = \min_{1 \le t \le T} P(\hat{y}_t \mid \hat{y}_{<t}, X)

This feature identifies the least confident token.

  • Lowest-K Token Probability (y^\hat{y}0):

Let y^\hat{y}1 be sorted token probabilities, y^\hat{y}2 determined via Kneedle elbow detection:

y^\hat{y}3

  • Attention-weighted Uncertainty (y^\hat{y}4):

Attention rollout is used to propagate downstream attention y^\hat{y}5 per token (using Abnar & Zuidema, 2020). Tokens are ranked by y^\hat{y}6, and the top y^\hat{y}7 are used for averaging:

y^\hat{y}8

with y^\hat{y}9 the indices of the y{0,1}y\in\{0,1\}0 largest y{0,1}y\in\{0,1\}1.

These features enable capture of local uncertainties critical in ACR tasks, where a globally correct-looking sequence may conceal token-level uncertainties indicative of error (Lin et al., 8 Apr 2026).

3. Hierarchical (Two-stage) Calibration Pipeline

The two-stage calibration pipeline comprises:

Stage 1: Local Platt-scaling

Each input–output pair y{0,1}y\in\{0,1\}2 is embedded into

y{0,1}y\in\{0,1\}3

where y{0,1}y\in\{0,1\}4, UMAP reduction is applied to concatenated text embeddings of y{0,1}y\in\{0,1\}5 and y{0,1}y\in\{0,1\}6, and y{0,1}y\in\{0,1\}7 is the scalar score for y{0,1}y\in\{0,1\}8. Embeddings are clustered using HDBSCAN, yielding y{0,1}y\in\{0,1\}9 clusters.

For each cluster p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)0 and score p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)1, local sigmoid calibrators are fitted:

p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)2

Outliers use a default backoff p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)3 (either a global sigmoid or identity).

Stage 2: Global Platt-scaling over Local Outputs

The locally calibrated probability triplet p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)4 forms the feature vector for a global logistic regression:

p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)5

The combined formula is:

p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)6

when p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)7 falls within cluster p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)8.

4. Calibration Metrics

Three metrics are utilized to quantify calibration on the test set:

  • Expected Calibration Error (ECE): Discretizes p^(y=1s)=σ(As+B)\hat{p}(y=1 \mid s) = \sigma(A s + B)9 into σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))0 equal-width bins. For each bin, σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))1, σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))2, σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))3:

σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))4

  • Brier Score (σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))5):

σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))6

  • Bin Coverage (BC): Number of bins populated with at least one sample (ideal is σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))7).

These metrics collectively assess both global calibration and the discriminative granularity of the confidence estimator (Lin et al., 8 Apr 2026).

5. Algorithmic Structure and Pseudocode

The pipeline proceeds as follows:

  1. For each fine-grained score type σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))8:
    • Compute score σ(z)=1/(1+exp(z))\sigma(z) = 1/(1+\exp(-z))9 for each calibration sample.
    • Compute embedded feature A,BRA, B \in \mathbb{R}0.
    • Cluster A,BRA, B \in \mathbb{R}1 using HDBSCAN.
    • Fit logistic regression calibrators A,BRA, B \in \mathbb{R}2 for each cluster.
    • Select backoff A,BRA, B \in \mathbb{R}3 for outliers.
  2. For each calibration sample:
    • Compute locally calibrated probabilities A,BRA, B \in \mathbb{R}4 using cluster-appropriate or backoff calibrators.
    • Stack A,BRA, B \in \mathbb{R}5 and fit a global logistic regression A,BRA, B \in \mathbb{R}6.
  3. For inference, compute the three fine-grained scores, embed, assign cluster or outlier, calibrate locally, then apply the global logit for final A,BRA, B \in \mathbb{R}7.

Pseudocode precisely formalizing these steps is provided in (Lin et al., 8 Apr 2026).

6. Empirical Results and Recommendations

Extensive evaluation across three ACR benchmarks—DCF-Bug (program repair), DCF-Vul (vulnerability repair), and CR-Trans (code refinement)—using 14 open-source, decoder-only LLMs (Llama-3.1, CodeLlama, Qwen2.5, Qwen2.5-Coder, DeepSeek; 7B–72B params) demonstrates:

  • Sequence-level Platt-scaling often produces low bin coverage (BC=1–3) and ECE ≳ 0.1, indicating poor calibration granularity.
  • Hierarchical Platt-scaling with fine-grained scores achieves ECE A,BRA, B \in \mathbb{R}8 and raised BC (≈5–9) for program and vulnerability repair (DCF-Bug, DCF-Vul). Minimum token probability is the most effective feature (lowest ECE, Brier, highest BC).
  • Local Platt-scaling is essential for code refinement (CR-Trans): It reduces ECE by up to A,BRA, B \in \mathbb{R}9 (down to ≈(A,B)=argminA,Bi=1N[yilogσ(Asi+B)(1yi)log(1σ(Asi+B))]+λ(A2+B2)(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)0) and increases BC by +2–5 bins. For DCF-Bug and DCF-Vul, local gains are smaller but consistent ((A,B)=argminA,Bi=1N[yilogσ(Asi+B)(1yi)log(1σ(Asi+B))]+λ(A2+B2)(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)1ECE ≈(A,B)=argminA,Bi=1N[yilogσ(Asi+B)(1yi)log(1σ(Asi+B))]+λ(A2+B2)(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)2; (A,B)=argminA,Bi=1N[yilogσ(Asi+B)(1yi)log(1σ(Asi+B))]+λ(A2+B2)(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)3BC ≈(A,B)=argminA,Bi=1N[yilogσ(Asi+B)(1yi)log(1σ(Asi+B))]+λ(A2+B2)(A, B) = \underset{A,B}{\mathrm{argmin}} \sum_{i=1}^N \left[-y_i \log \sigma(A s_i + B) - (1 - y_i)\log(1 - \sigma(A s_i + B)) \right] + \lambda(A^2 + B^2)4).
  • Practical guidance: For DCF-Bug/DC F-Vul, global Platt + minimum token is sufficient for well-calibrated output. Local Platt-scaling should be used for code refinement, or when further ECE reduction is required (Lin et al., 8 Apr 2026).

7. Significance and Implications

Hierarchical Platt scaling provides a principled and empirically validated framework for confidence calibration in ACR tasks, addressing the shortcomings of global-only approaches. By leveraging local context via token-level features and cluster-specific calibrators, it produces more informative probability outputs. All code, calibrators, and replication scripts are available in the corresponding repository (Lin et al., 8 Apr 2026). This approach enables reliable, instance-level decision-making, facilitating trustworthy integration of LLMs into practical software engineering pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Platt Scaling.