Structured Uncertainty Similarity Score
- SUSS is a family of metrics that quantifies similarity via structured, model-specific decompositions in both vision and clinical applications.
- The approach uses probabilistic models and ranking-based evaluations to offer interpretable, localized uncertainty measurements and error attributions.
- Empirical findings demonstrate SUSS outperforms classical methods, delivering robust calibration in image quality and reliable uncertainty stratification in survival analysis.
The Structured Uncertainty Similarity Score (SUSS) is a family of statistical metrics for quantifying similarity or uncertainty based on the structural correspondence between data representations and their predicted behavior under learned stochastic or discriminative models. SUSS instantiates as two distinct frameworks: a probabilistic, interpretable perceptual metric for image comparison in computer vision (Seidler et al., 3 Dec 2025), and as an individual-uncertainty quantification index for patient-level survival models in clinical prediction (Wang et al., 2023). Each variant formalizes similarity via internally structured, model-specific decompositions and ranking-based or probabilistic evaluation schemes, applicable both to deep learning and statistical modeling.
1. Probabilistic Perceptual Similarity in Computer Vision
SUSS for image similarity is grounded in a generative, self-supervised probabilistic model that decomposes an image into perceptual components—such as multi-scale luminance and chrominance channels. For each component , the model predicts a structured multivariate Normal distribution over human-imperceptible perturbations:
with both mean and covariance parameterized as image-dependent functions. To ensure tractability and local interpretability, the precision matrix is represented via a sparse Cholesky factor , such that
where is lower-triangular and only nonzero in local neighborhoods. This structure supports localized decorrelation and enables efficient log-density computation.
Given a reference image and a test image , each component yields a residual vector . The log-likelihood under the component’s Gaussian is
with the Mahalanobis term efficiently computable as , and .
The global SUSS score between and is a weighted sum:
where nonnegative weights are learned using human-labeled pairwise preference data, via cross-entropy loss applied to difference-of-score logits on two-alternative forced choice (2AFC) triplets.
2. Self-Supervised and Human-Calibrated Learning for Perceptual Models
The model for SUSS in perceptual tasks is trained by maximizing the likelihood of small, human-imperceptible augmentations at multiple scales and augmentation levels. The generative goal is to encourage high log-probability for such minimally-distorted variants:
with favoring stricter invariances.
Image-specific whitening transforms, instantiated as , provide explicit insight into perceptually salient residuals: high-magnitude coordinates correspond to pixel neighborhoods and features that the trained SUSS model deems important for similarity judgments. The sparsity and locality of (implemented using U-Net architectures with sparse connectivity) accentuate model transparency.
3. Sampling-Based Local Explanations and Generative Introspection
Sampling from the structured component Gaussians enables visualization and exploration of perceptually plausible images in the vicinity of :
By generating images at various quantiles of the log-likelihood, one can empirically demonstrate the tightness of each component’s invariance and provide localized, human-interpretable error attributions.
4. Empirical Performance and Benchmarks
On 2AFC human perceptual benchmarks (BAPPS, PieAPP, PIPAL), SUSSBase (without human fine-tuning) outperforms classic metrics such as PSNR, SSIM, and MS-SSIM, and approaches the accuracy of deep feature-based metrics such as LPIPS. With fine-tuning (e.g., SUSSBAPPS-RH, SUSSPieApp-RH), SUSS achieves 62–64% 2AFC accuracy on BAPPS (LPIPS ∼68%), and exhibits competitive Spearman and PLCC on PieAPP and PIPAL.
On the KADID-10k dataset, SUSS exhibits strong perceptual calibration across blur, noise, and compression categories, yielding the lowest category-wise KL divergences to human mean opinion scores (MOS), indicating uniform distance assignments aligned with human percepts. Violin plots of SUSS distributions show tight demarcation between imperceptible and clearly perceptible distortions.
As a training loss, SUSS ensures stable optimization and artifact-free image reconstructions, matching or exceeding the qualitative sharpness and cleanliness of results produced by LPIPS and SSIM losses, while possessing formally convex local structure (by Mahalanobis norm properties) and Lipschitz-continuous gradients (Seidler et al., 3 Dec 2025).
5. Patient-Level Uncertainty Quantification in Survival Models
A distinct SUSS framework is defined for uncertainty quantification in survival prediction models for metastatic brain tumor patients (Wang et al., 2023). For an individual patient , SUSS assigns a certainty score based on concordance between two rank orderings across the training set:
- Feature-space similarity ranking: For each training point , compute a feature-wise dissimilarity loss (combining clinical nomogram differences and feature mismatch counts), and create an ascending patient rank .
- Prediction-space (model output) ranking: Compute model predictions and , average train-patient predictions in clusters grouped by , then calculate squared prediction error to ’s prediction, ranking these as .
- Pairwise concordance (C-index): The patient’s SUSS score is the fraction of group pairs for which the orderings agree:
where is the number of patient groups.
Values near $1$ indicate high agreement (prediction is similar to most feature-similar patients—low uncertainty); values near $0.5$ denote low information (prediction does not track with feature proximity).
6. Model-Level Uncertainty and Empirical Findings in Clinical Prediction
Model-level uncertainty is quantified via the increase in time-dependent AUC (C-index) that results from restricting evaluation to patients above a given SUSS threshold :
Empirical studies on 1383 brain metastasis patients and several survival models (CoxPH, CSF, NMTLR) show coherence with this metric:
- NMTLR exhibits the lowest uncertainty (e.g., 1.6% on ICP, 2.0% on OS), followed by CSF and CoxPH.
- Uncertainty is lowest on endpoints with simple progression (ICP) and highest on complex composites (ICPD).
- Restricting test sets by high SUSS thresholds yields time-dependent AUC gains of up to 15–20% relative to baseline, indicating SUSS effectively stratifies by predictive certainty.
Pseudocode is provided in (Wang et al., 2023) for reproducible computation of patient-level SUSS.
7. Summary and Interpretability
The SUSS framework, across both domains, is characterized by the following properties:
- It formalizes similarity as either a log-likelihood under a structured local generative model (computer vision) or as a ranking-concordance index (clinical prediction).
- It is explicitly probabilistic, interpretable, and enables localized explanations through whitening transformations or group-wise stratification.
- Training leverages self-supervised invariance (vision) and domain-appropriate feature metrics (clinical).
- Empirically, SUSS delivers competitive alignment with human judgment, robust perceptual calibration, and reliable uncertainty quantification without reducing to opaque feature metrics.
References: (Seidler et al., 3 Dec 2025, Wang et al., 2023).