Hierarchical Platt Scaling in Automated Code Revision
- Hierarchical Platt Scaling is a two-stage post-hoc calibration approach that fuses token-level signal extraction with global logistic regression for reliable confidence estimation in code revision tasks.
- It leverages fine-grained token scores—such as minimum token probability, low-K averages, and attention-weighted uncertainty—to capture local uncertainties in predictions.
- Empirical results on multiple ACR benchmarks show reduced Expected Calibration Error and increased bin coverage, validating its practical impact for program and vulnerability repair.
Hierarchical Platt scaling is a two-stage post-hoc confidence calibration approach for LLMs in automated code revision (ACR) tasks. It addresses the limitations of conventional sequence-level Platt-scaling by incorporating fine-grained, token-level signal extraction combined with local and global calibration layers. This methodology yields calibrated instance-level confidence scores that more faithfully reflect ground-truth correctness, which is critical for downstream tasks such as program repair, vulnerability repair, and code refinement (Lin et al., 8 Apr 2026).
1. Platt-scaling Formalism
Platt-scaling refers to the application of a logistic regression model to raw model scores to produce calibrated probabilities. Given a score (such as a logit or model-produced likelihood) for an output and binary ground-truth label , Platt-scaling fits the following transformation:
where , and parameters are estimated by minimizing regularized negative log-likelihood over a held-out calibration set:
with L2-regularization parameter (Lin et al., 8 Apr 2026).
2. Fine-grained Confidence Scores
Hierarchical Platt scaling leverages token-level softmax traces from the autoregressive decoding process to compute confidence features, rather than relying solely on sequence-level aggregate scores. Three fine-grained scores are extracted for each predicted sequence:
- Minimum Token Probability ():
This feature identifies the least confident token.
- Lowest-K Token Probability (0):
Let 1 be sorted token probabilities, 2 determined via Kneedle elbow detection:
3
- Attention-weighted Uncertainty (4):
Attention rollout is used to propagate downstream attention 5 per token (using Abnar & Zuidema, 2020). Tokens are ranked by 6, and the top 7 are used for averaging:
8
with 9 the indices of the 0 largest 1.
These features enable capture of local uncertainties critical in ACR tasks, where a globally correct-looking sequence may conceal token-level uncertainties indicative of error (Lin et al., 8 Apr 2026).
3. Hierarchical (Two-stage) Calibration Pipeline
The two-stage calibration pipeline comprises:
Stage 1: Local Platt-scaling
Each input–output pair 2 is embedded into
3
where 4, UMAP reduction is applied to concatenated text embeddings of 5 and 6, and 7 is the scalar score for 8. Embeddings are clustered using HDBSCAN, yielding 9 clusters.
For each cluster 0 and score 1, local sigmoid calibrators are fitted:
2
Outliers use a default backoff 3 (either a global sigmoid or identity).
Stage 2: Global Platt-scaling over Local Outputs
The locally calibrated probability triplet 4 forms the feature vector for a global logistic regression:
5
The combined formula is:
6
when 7 falls within cluster 8.
4. Calibration Metrics
Three metrics are utilized to quantify calibration on the test set:
- Expected Calibration Error (ECE): Discretizes 9 into 0 equal-width bins. For each bin, 1, 2, 3:
4
- Brier Score (5):
6
- Bin Coverage (BC): Number of bins populated with at least one sample (ideal is 7).
These metrics collectively assess both global calibration and the discriminative granularity of the confidence estimator (Lin et al., 8 Apr 2026).
5. Algorithmic Structure and Pseudocode
The pipeline proceeds as follows:
- For each fine-grained score type 8:
- Compute score 9 for each calibration sample.
- Compute embedded feature 0.
- Cluster 1 using HDBSCAN.
- Fit logistic regression calibrators 2 for each cluster.
- Select backoff 3 for outliers.
- For each calibration sample:
- Compute locally calibrated probabilities 4 using cluster-appropriate or backoff calibrators.
- Stack 5 and fit a global logistic regression 6.
- For inference, compute the three fine-grained scores, embed, assign cluster or outlier, calibrate locally, then apply the global logit for final 7.
Pseudocode precisely formalizing these steps is provided in (Lin et al., 8 Apr 2026).
6. Empirical Results and Recommendations
Extensive evaluation across three ACR benchmarks—DCF-Bug (program repair), DCF-Vul (vulnerability repair), and CR-Trans (code refinement)—using 14 open-source, decoder-only LLMs (Llama-3.1, CodeLlama, Qwen2.5, Qwen2.5-Coder, DeepSeek; 7B–72B params) demonstrates:
- Sequence-level Platt-scaling often produces low bin coverage (BC=1–3) and ECE ≳ 0.1, indicating poor calibration granularity.
- Hierarchical Platt-scaling with fine-grained scores achieves ECE 8 and raised BC (≈5–9) for program and vulnerability repair (DCF-Bug, DCF-Vul). Minimum token probability is the most effective feature (lowest ECE, Brier, highest BC).
- Local Platt-scaling is essential for code refinement (CR-Trans): It reduces ECE by up to 9 (down to ≈0) and increases BC by +2–5 bins. For DCF-Bug and DCF-Vul, local gains are smaller but consistent (1ECE ≈2; 3BC ≈4).
- Practical guidance: For DCF-Bug/DC F-Vul, global Platt + minimum token is sufficient for well-calibrated output. Local Platt-scaling should be used for code refinement, or when further ECE reduction is required (Lin et al., 8 Apr 2026).
7. Significance and Implications
Hierarchical Platt scaling provides a principled and empirically validated framework for confidence calibration in ACR tasks, addressing the shortcomings of global-only approaches. By leveraging local context via token-level features and cluster-specific calibrators, it produces more informative probability outputs. All code, calibrators, and replication scripts are available in the corresponding repository (Lin et al., 8 Apr 2026). This approach enables reliable, instance-level decision-making, facilitating trustworthy integration of LLMs into practical software engineering pipelines.