SICL: Style Invariance as Correctness Likelihood

Updated 15 December 2025

The paper introduces SICL, a post-hoc uncertainty calibration method that estimates instance-wise prediction correctness via style perturbations.
It leverages controlled, content-preserving transformations to measure consistency across style-altered predictions using a simple scoring mechanism.
SICL achieves significant reductions in Expected Calibration Error (ECE) and integrates seamlessly with any test-time adaptation framework without backpropagation.

Style Invariance as a Correctness Likelihood (SICL) is a post-hoc uncertainty calibration framework for @@@@1@@@@ (TTA) of deep learning models. SICL leverages the principle that model predictions should be invariant to style alterations—perturbations orthogonal to task-relevant content—while remaining sensitive to content changes, thereby estimating the instance-wise likelihood of prediction correctness. This plug-and-play, backpropagation-free module can be used with any TTA method, addressing the challenge of predictive miscalibration arising when models are continually adapted on non-stationary, unlabeled test distributions (Nam et al., 8 Dec 2025).

1. Theoretical Rationale

TTA procedures adapt deployed models on-the-fly to out-of-distribution data predominantly via entropy minimization or similar objectives, but this often results in overconfident and miscalibrated outputs, especially under distributional shift. Conventional calibration techniques, including temperature scaling and MC Dropout, generally presuppose frozen model parameters or static data distributions, and are therefore inadequate for calibration under dynamic test-time adaptation.

SICL is conceptually grounded in the “style–content” disentanglement established in representation learning (e.g., Gatys et al., Huang & Belongie, von Kügelgen et al.). Let $x$ denote input data, generated by two independent factors: $c$ (content, task-relevant) and $s$ (style, nuisance). In this causal model:

An ideal classifier’s decision is exclusively a function of $c$ .
Style-perturbing $x$ while preserving $c$ should not affect the model’s prediction.

From this, prediction consistency under controlled, content-preserving style transformations can serve as a proxy for a model’s correctness: high consistency across style-altered versions of $x$ correlates with genuine confidence, while discordance signals unreliability, especially for samples near decision boundaries or under covariate shift.

2. Mathematical Formulation

Given a model $f$ (potentially updated per TTA) with softmax output $p(x) \in \Delta^{K-1}$ for $K$ classes, SICL assesses correctness likelihood as follows:

Generate $N$ style-perturbed variants $\{\tau_j(x)\}_{j=1}^N$ of $x$ via feature-level perturbations (see Section 3).
Let $p_0 = p(x)$ and $p_j = p(\tau_j(x))$ for $j=1,\ldots,N$ , and corresponding predictions $\hat y_0 = \arg\max_k p_0[k]$ , $\hat y_j = \arg\max_k p_j[k]$ .

The vanilla style-invariance score is

$\gamma^{\rm style}(x) = \frac{1}{N} \sum_{j=1}^N \mathbf{1}\{\hat y_j = \hat y_0\}$

Alternative consistency metrics include $L_{\ell_1}(x) = 1 - \frac{1}{N} \sum_{j=1}^N \|p_0 - p_j\|_1$ or a KL-based score, but $\gamma^{\rm style}$ is preferred in practice.

To mitigate spurious agreement due to style or feature collapse during adaptation, SICL introduces a content-relaxation:

Generate $N$ content-perturbed variants $\{\zeta_j(x)\}$ (whiten, add noise, reproject).
Compute

$\gamma^{\rm content}(x) = \frac{1}{N} \sum_{j=1}^N \mathbf{1}\{\hat y(\zeta_j(x)) = \hat y_0\}$

The relaxation weight is $\omega_{\rm relax}(x) = 1 - \gamma^{\rm content}(x)$ .
The final correctness likelihood is

$L(x) = \omega_{\rm relax}(x) \cdot \gamma^{\rm style}(x)$

yielding $L(x) \in [0,1]$ as a calibrated, instance-wise confidence.

3. SICL Algorithmic Workflow

For each sample $x$ after any TTA update, SICL proceeds as follows:

Forward $x$ through $f$ to obtain $p_0$ , $\hat y_0$ .
Extract early feature tensor $z$ (e.g., first ResNet block or ViT patch embedding).
Compute channel-wise mean $\mu$ and stddev $\sigma$ of $z$ .
For $j=1...N$ , sample Gaussian noise $(\epsilon_\mu, \epsilon_\sigma)$ , compute style-perturbed statistics $\mu' = \mu + \delta\epsilon_\mu$ , $\sigma' = \sigma + \delta\epsilon_\sigma$ , and reconstruct feature $z' = \sigma'((z-\mu)/\sigma) + \mu'$ . Reconstruct $x_j'$ and obtain $p_j$ , $\hat y_j$ .
Compute $\gamma^{\rm style}$ as the fraction of $\hat y_j$ matching $\hat y_0$ .
(Optional) For $j=1...N$ generate content-perturbed features by whitening $z$ , adding noise, re-projecting, and evaluating model outputs to compute $\gamma^{\rm content}$ .
Set $\omega_{\rm relax} = 1 - \gamma^{\rm content}$ and $L(x) = \omega_{\rm relax} \cdot \gamma^{\rm style}$ .

SICL operates without gradient updates; the computational overhead is $N$ additional forward-passes per instance (typically $N=5$ –$20$).

4. Integration and Applicability

SICL is architected for plug-and-play deployment:

Interface: $L(x)$ is computed per sample after each test-time adaptation step, replacing or modulating softmax confidence. TTA algorithms (e.g., entropy minimization, BatchNorm adaptation) are unaffected.
Computation: $O(N)$ per-sample forward-passes; for batch size $B$ this is $O(BN)$ , with full parallelizability across $N$ .
Hyperparameters:
- $N$ : number of style/content variants (default 20).
- $\delta$ : scaling of feature statistics noise (empirical std of $\mu$ in batch).
- Feature extraction layer: shallow layers are preferred for style capture.

No additional parameters are learned. SICL is agnostic to backbone architecture and can be applied to any TTA procedure.

5. Empirical Evaluation

SICL was evaluated on CIFAR-10-C, CIFAR-100-C, and ImageNet-C (severity 5), using backbones ResNet-50, ResNet-101, and ViT (Small/Base). TTA methods included TENT, EATA, SAR, RoTTA, and SoTTA (and their LN-only ViT variants). Comparator calibration methods were Vanilla Temperature Scaling (TS), TransCal, PseudoCal, and MC Dropout.

The metric was cumulative Expected Calibration Error (ECE, %), evaluated on two data regimes: Benign (i.i.d. single corruption) and Dynamic (Dirichlet-mixed, time-varying corruptions).

Summary (averaged over 72 conditions):

MC Dropout: ECE $\approx$ 15–20% (CIFAR-10-C)
PseudoCal: ECE $\approx$ 6–8% (best static TS)
SICL: ECE $\approx$ 7.7% (Benign), reducing ECE by 10.4 percentage points from PseudoCal and 15.6 points in Dynamic. Overall, SICL achieves an average reduction in ECE of 13 percentage points relative to the next-best approach.

Table: Excerpt from Benign stream, ResNet-50, CIFAR-10-C

Method	TENT	EATA	SAR	RoTTA	SoTTA (avg)
Vanilla TS	22.7	22.3	22.2	22.9	22.8
MC Dropout	16.3	11.1	10.2	10.0	10.4
PseudoCal	9.4	7.8	5.2	7.9	8.4
SICL	6.4	6.9	7.5	7.1	7.5

Comparable improvements hold under dynamic regime and across datasets and architectures.

6. Ablation and Qualitative Analysis

Several ablation studies and analyses isolate crucial aspects of SICL:

Style vs. Content Perturbation: Using content perturbations as ensemble variants increases ECE fourfold compared to style perturbations; content preservation is critical.
Candidate Quality: “ContentVariance” measured by Mahalanobis distance to class centroids is significantly lower for SICL’s style-perturbed variants than for MC Dropout, indicating that SICL preserves task-relevant geometry.
Style Variance: Style variance via Gram-matrix loss shows that SICL explores a broader style space than MixStyle or Dropout, yielding a richer ensemble.
Component Ablation: Removing the relaxation factor $\omega_{\rm relax}$ increases ECE by 1–2 points. Coarse $\mu$ -perturbation is the dominant style perturbation; combining $\mu$ and $\sigma$ is optimal.
Sensitivity to N: ECE $<$ 10% is reached with as few as $N=5$ style variants; performance improves up to $N=20$ .

7. Significance and Implications

SICL provides a lightweight, robust calibration layer for any test-time adaptation protocol. By operationalizing the content-invariance principle through style-perturbed prediction consistency and a collapse relaxation criterion, SICL offers a correctness likelihood that tracks true accuracy under both stationary and dynamically shifting test distributions. This resolves a key limitation of classical calibration under continual adaptation, and achieves substantial improvements on standard benchmarks, reducing calibration error by approximately 13 percentage points compared to the strongest prior alternatives (Nam et al., 8 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Towards Reliable Test-Time Adaptation: Style Invariance as a Correctness Likelihood (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Style Invariance as a Correctness Likelihood (SICL).