TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Published 17 Apr 2026 in cs.LG | (2604.15950v1)

Abstract: Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces an isotonic regression-based calibration method that aligns voxel-level segmentation predictions with the empirical mean human response from multiple experts.
It employs a two-stage approach with a coarse nnU-Net for localization and a high-resolution ensemble for detailed segmentation, refined by post-hoc calibration.
TwinTrack improves soft Dice scores and reduces calibration errors, enhancing uncertainty quantification and supporting more reliable clinical decisions in ambiguous imaging tasks.

TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Motivation and Problem Statement

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT exemplifies the challenges inherent in ambiguous delineation tasks, where substantial inter-expert variability is a prevalent and irreducible component of the annotation process. Existing deep learning segmentation models, typically supervised with a single consensus mask, fail to adequately capture the genuine epistemic and aleatoric uncertainty reflected in disparate expert annotations. Probabilistic outputs from such models are generally calibrated to a singular, and often artificially definitive, ground truth, leading to confidence estimates that are unreliable in settings where label ambiguity is non-negligible. Prior efforts have largely focused on ambiguity-aware training paradigms, or single-rater calibration, leaving a critical methodological gap in post-hoc calibration schemes explicitly targeting multi-rater, voxel-level distributions.

Methodology Overview

TwinTrack introduces a principled, post-hoc calibration framework designed to align the probabilistic segmentation outputs of a fixed deep learning ensemble with the empirical mean human response (MHR)—that is, the fraction of annotators agreeing on the tumor class at each voxel. The segmentation pipeline adopts a two-stage strategy: coarse localization of the pancreatic region using a low-resolution nnU-Net, followed by high-resolution refinement via an ensemble of three independently trained nnU-Nets, yielding averaged voxelwise probabilistic estimates.

Figure 1: The TwinTrack pipeline comprises coarse ROI definition, high-resolution ensemble inference, and a post-hoc isotonic regression calibration step mapping predictions to the mean human response (MHR).

The core contribution is the isotonic regression-based calibration layer. This monotonic mapping is learned using a relatively small calibration set with multiple expert annotations, leveraging the property that in the multi-rater regime, the theoretically optimal calibration target coincides with the voxelwise MHR. Detailed justification is provided via a reduction of the multi-rater isotonic regression objective to squared-error minimization with respect to the MHR, as formalized in the appendices.

The calibration step confers several key advantages:

Calibration is robust to pervasive rater disagreement, translating highly variable voxels to intermediate probabilities rather than enforcing spurious hard labels.
Probabilistic outputs become directly interpretable as the expected consensus fraction among annotators, supporting downstream uncertainty quantification.
No retraining or architectural modifications of the underlying segmentation model are required; only the final probabilistic outputs are transformed post-hoc.

Experimental Results

Extensive evaluation is conducted on the MICCAI 2025 CURVAS--PDACVI multi-rater benchmark. Segmentation ensembles are trained on single-annotator labels, with the post-hoc calibration fit exclusively on multi-rater calibration data. Comparative baselines include models calibrated to a single-rater target as well as to hard-label targets derived from binary masks.

Across primary and secondary metrics—including Thresholding Dice Score (TDSC) for soft segmentation performance, Expected Calibration Error (ECE), Continuous Ranked Probability Score (CRPS), and vessel-specific vascular invasion (VI) scores—TwinTrack consistently achieves superior outcomes.

Notably, calibration to the MHR improves TDSC over the uncalibrated pipeline by a significant margin and reduces both ECE and CRPS, demonstrating improved alignment with the ground-truth distribution of expert opinions. In vascular invasion assessment, TwinTrack attains the lowest average error for four of five evaluated vessels.

Figure 2: Reliability diagrams demonstrate improved calibration with TwinTrack; calibrated predictions more closely adhere to the ideal diagonal when compared to uncalibrated ensemble outputs across the multi-rater cohort.

Qualitative analysis shows that uncalibrated models frequently exhibit overconfident predictions inconsistent with areas of high expert disagreement, whereas TwinTrack produces confidence maps that more faithfully represent human annotation uncertainty.

Figure 3: In select cases, TwinTrack calibration yields spatially coherent probabilistic predictions matching the graded uncertainty in expert labels, rectifying the overconfidence seen in typical uncalibrated models.

Numerical highlights include:

A relative increase in soft Dice (TDSC) from 0.553 (uncalibrated) to 0.569 (TwinTrack) and a corresponding decrease in ECE (0.0156 to 0.0147).
In VI metrics, TwinTrack achieves up to 20% lower error compared to alternative calibration targets on specific vessels.

Theoretical and Practical Implications

TwinTrack establishes calibration to the MHR via isotonic regression as the theoretically justified and empirically robust approach to translating ensemble outputs into meaningful probabilistic predictions in multi-rater settings. The method circumvents the need for annotation harmonization or consensus-simulating schemes, instead taking rater variability as central to the segmentation uncertainty landscape. From a practical standpoint, this enables more nuanced uncertainty quantification for downstream clinical tasks such as treatment planning, risk assessment, and human-in-the-loop workflows, particularly for ill-posed delineation tasks.

The explicit post-hoc nature of the approach ensures immediate applicability to existing segmentation pipelines with minimal data requirements for calibration fitting. TwinTrack's success in the CURVAS--PDACVI challenge substantiates its viability for standardized, challenge-level evaluation protocols and its potential utility in broader, multi-rater medical imaging contexts.

Future Directions

Potential extensions involve adaptation to other ambiguous, multi-rater segmentation domains such as neuroimaging, histopathology, and radiotherapy planning. Further work may consider integration with online recalibration mechanisms, scalable binning schemes, or theoretical exploration of alternative calibration mappings when annotator populations are large or rater reliability varies over cases.

Conclusion

TwinTrack addresses a critical gap in medical image segmentation by introducing a theoretically principled and empirically validated post-hoc calibration method specifically tailored for multi-rater annotation contexts. Through isotonic regression calibration to the MHR, probabilistic outputs from segmentation ensembles become both well-calibrated and interpretable as true measures of inter-expert consensus, directly supporting both quantitative evaluation and uncertainty-aware clinical decision-making (2604.15950).

Markdown Report Issue