SICL: Style Invariance as Correctness Likelihood
- The paper introduces SICL, a post-hoc uncertainty calibration method that estimates instance-wise prediction correctness via style perturbations.
- It leverages controlled, content-preserving transformations to measure consistency across style-altered predictions using a simple scoring mechanism.
- SICL achieves significant reductions in Expected Calibration Error (ECE) and integrates seamlessly with any test-time adaptation framework without backpropagation.
Style Invariance as a Correctness Likelihood (SICL) is a post-hoc uncertainty calibration framework for @@@@1@@@@ (TTA) of deep learning models. SICL leverages the principle that model predictions should be invariant to style alterations—perturbations orthogonal to task-relevant content—while remaining sensitive to content changes, thereby estimating the instance-wise likelihood of prediction correctness. This plug-and-play, backpropagation-free module can be used with any TTA method, addressing the challenge of predictive miscalibration arising when models are continually adapted on non-stationary, unlabeled test distributions (Nam et al., 8 Dec 2025).
1. Theoretical Rationale
TTA procedures adapt deployed models on-the-fly to out-of-distribution data predominantly via entropy minimization or similar objectives, but this often results in overconfident and miscalibrated outputs, especially under distributional shift. Conventional calibration techniques, including temperature scaling and MC Dropout, generally presuppose frozen model parameters or static data distributions, and are therefore inadequate for calibration under dynamic test-time adaptation.
SICL is conceptually grounded in the “style–content” disentanglement established in representation learning (e.g., Gatys et al., Huang & Belongie, von Kügelgen et al.). Let denote input data, generated by two independent factors: (content, task-relevant) and (style, nuisance). In this causal model:
- An ideal classifier’s decision is exclusively a function of .
- Style-perturbing while preserving should not affect the model’s prediction.
From this, prediction consistency under controlled, content-preserving style transformations can serve as a proxy for a model’s correctness: high consistency across style-altered versions of correlates with genuine confidence, while discordance signals unreliability, especially for samples near decision boundaries or under covariate shift.
2. Mathematical Formulation
Given a model (potentially updated per TTA) with softmax output for classes, SICL assesses correctness likelihood as follows:
- Generate style-perturbed variants of via feature-level perturbations (see Section 3).
- Let and for , and corresponding predictions , .
The vanilla style-invariance score is
Alternative consistency metrics include or a KL-based score, but is preferred in practice.
To mitigate spurious agreement due to style or feature collapse during adaptation, SICL introduces a content-relaxation:
- Generate content-perturbed variants (whiten, add noise, reproject).
- Compute
- The relaxation weight is .
- The final correctness likelihood is
yielding as a calibrated, instance-wise confidence.
3. SICL Algorithmic Workflow
For each sample after any TTA update, SICL proceeds as follows:
- Forward through to obtain , .
- Extract early feature tensor (e.g., first ResNet block or ViT patch embedding).
- Compute channel-wise mean and stddev of .
- For , sample Gaussian noise , compute style-perturbed statistics , , and reconstruct feature . Reconstruct and obtain , .
- Compute as the fraction of matching .
- (Optional) For generate content-perturbed features by whitening , adding noise, re-projecting, and evaluating model outputs to compute .
- Set and .
SICL operates without gradient updates; the computational overhead is additional forward-passes per instance (typically –$20$).
4. Integration and Applicability
SICL is architected for plug-and-play deployment:
- Interface: is computed per sample after each test-time adaptation step, replacing or modulating softmax confidence. TTA algorithms (e.g., entropy minimization, BatchNorm adaptation) are unaffected.
- Computation: per-sample forward-passes; for batch size this is , with full parallelizability across .
- Hyperparameters:
- : number of style/content variants (default 20).
- : scaling of feature statistics noise (empirical std of in batch).
- Feature extraction layer: shallow layers are preferred for style capture.
No additional parameters are learned. SICL is agnostic to backbone architecture and can be applied to any TTA procedure.
5. Empirical Evaluation
SICL was evaluated on CIFAR-10-C, CIFAR-100-C, and ImageNet-C (severity 5), using backbones ResNet-50, ResNet-101, and ViT (Small/Base). TTA methods included TENT, EATA, SAR, RoTTA, and SoTTA (and their LN-only ViT variants). Comparator calibration methods were Vanilla Temperature Scaling (TS), TransCal, PseudoCal, and MC Dropout.
The metric was cumulative Expected Calibration Error (ECE, %), evaluated on two data regimes: Benign (i.i.d. single corruption) and Dynamic (Dirichlet-mixed, time-varying corruptions).
Summary (averaged over 72 conditions):
- MC Dropout: ECE 15–20% (CIFAR-10-C)
- PseudoCal: ECE 6–8% (best static TS)
- SICL: ECE 7.7% (Benign), reducing ECE by 10.4 percentage points from PseudoCal and 15.6 points in Dynamic. Overall, SICL achieves an average reduction in ECE of 13 percentage points relative to the next-best approach.
Table: Excerpt from Benign stream, ResNet-50, CIFAR-10-C
| Method | TENT | EATA | SAR | RoTTA | SoTTA (avg) |
|---|---|---|---|---|---|
| Vanilla TS | 22.7 | 22.3 | 22.2 | 22.9 | 22.8 |
| MC Dropout | 16.3 | 11.1 | 10.2 | 10.0 | 10.4 |
| PseudoCal | 9.4 | 7.8 | 5.2 | 7.9 | 8.4 |
| SICL | 6.4 | 6.9 | 7.5 | 7.1 | 7.5 |
Comparable improvements hold under dynamic regime and across datasets and architectures.
6. Ablation and Qualitative Analysis
Several ablation studies and analyses isolate crucial aspects of SICL:
- Style vs. Content Perturbation: Using content perturbations as ensemble variants increases ECE fourfold compared to style perturbations; content preservation is critical.
- Candidate Quality: “ContentVariance” measured by Mahalanobis distance to class centroids is significantly lower for SICL’s style-perturbed variants than for MC Dropout, indicating that SICL preserves task-relevant geometry.
- Style Variance: Style variance via Gram-matrix loss shows that SICL explores a broader style space than MixStyle or Dropout, yielding a richer ensemble.
- Component Ablation: Removing the relaxation factor increases ECE by 1–2 points. Coarse -perturbation is the dominant style perturbation; combining and is optimal.
- Sensitivity to N: ECE10% is reached with as few as style variants; performance improves up to .
7. Significance and Implications
SICL provides a lightweight, robust calibration layer for any test-time adaptation protocol. By operationalizing the content-invariance principle through style-perturbed prediction consistency and a collapse relaxation criterion, SICL offers a correctness likelihood that tracks true accuracy under both stationary and dynamically shifting test distributions. This resolves a key limitation of classical calibration under continual adaptation, and achieves substantial improvements on standard benchmarks, reducing calibration error by approximately 13 percentage points compared to the strongest prior alternatives (Nam et al., 8 Dec 2025).